akeyhero commited on
Commit
82533ff
1 Parent(s): cf5058d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - globis-university/aozorabunko-clean
5
+ - oscar-corpus/OSCAR-2301
6
+ language:
7
+ - ja
8
+ ---
9
+
10
+ # What’s this?
11
+ This is a model based on DeBERTa V3 pre-trained on Japanese resources.
12
+
13
+ The model has the following features:
14
+ - Based on the well-known DeBERTa V3 model
15
+ - Specialized for the Japanese language
16
+ - Does not use a morphological analyzer during inference
17
+ - Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
18
+
19
+ ---
20
+
21
+ 日本語リソースで学習した DeBERTa V3 モデルです。
22
+
23
+ 以下のような特徴を持ちます:
24
+
25
+ - 定評のある DeBERTa V3 を用いたモデル
26
+ - 日本語特化
27
+ - 推論時に形態素解析器を用いない
28
+ - 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
29
+
30
+ # How to use
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
33
+
34
+ model_name = 'globis-university/deberta-v3-japanese-base'
35
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
36
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
37
+ ```
38
+
39
+ # Tokenizer
40
+ The tokenizer is trained using the method demonstrated by Kudo.
41
+
42
+ Key points include:
43
+ - No morphological analyzer needed during inference
44
+ - Tokens do not cross word boundaries
45
+ - Easy to use with Hugging Face
46
+ - Smaller vocabulary size
47
+
48
+ Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.
49
+
50
+ ---
51
+
52
+ 工藤氏によって示された手法で学習した。
53
+
54
+ 以下のことを意識している:
55
+
56
+ - 推論時の形態素解析器なし
57
+ - トークンが単語の境界を跨がない
58
+ - Hugging Faceで使いやすい
59
+ - 大きすぎない語彙数
60
+
61
+ 本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。
62
+
63
+ # Data
64
+ | Dataset Name | Notes | File Size (with metadata) | Times |
65
+ | ------------- | ----- | ------------------------- | ---------- |
66
+ | Wikipedia | 2023/07; WikiExtractor | 3.5GB | x2 |
67
+ | Wikipedia | 2023/07; cl-tohoku's method | 4.8GB | x2 |
68
+ | WikiBooks | 2023/07; cl-tohoku's method | 43MB | x2 |
69
+ | Aozora Bunko | 2023/07; globis-university/aozorabunko-clean | 496MB | x4 |
70
+ | CC-100 | ja | 90GB | x1 |
71
+ | mC4 | ja; extracted 10% of Wikipedia-like data using DSIR | 91GB | x1 |
72
+ | OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using DSIR | 26GB | x1 |
73
+
74
+ # Training parameters
75
+ - Number of devices: 8
76
+ - Batch size: 24 x 8
77
+ - Learning rate: 1.92e-4
78
+ - Maximum sequence length: 512
79
+ - Optimizer: AdamW
80
+ - Learning rate scheduler: Linear schedule with warmup
81
+ - Training steps: 1,000,000
82
+ - Warmup steps: 100,000
83
+ - Precision: Mixed (fp16)
84
+
85
+ # Evaluation
86
+ | Model | JSTS | JNLI | JSQuAD | JCQA |
87
+ | ----- | ---- | ---- | ------ | ---- |
88
+ | ≤ small | | | | |
89
+ | [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
90
+ | [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 0.916/0.880 | 0.913 | 0.869/0.938 | 0.821 |
91
+ | base | | | | |
92
+ | [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
93
+ | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
94
+ | [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
95
+ | [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | 0.899/0.951 | - |
96
+ | [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 0.927/0.891 | 0.927 | 0.896/- | - |
97
+ | [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925/0.895 | 0.921 | 0.890/0.950 | 0.886 |
98
+ | large | | | | |
99
+ | [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | 0.929 | 0.893/0.956 | 0.893 |
100
+ | [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 0.930/0.896 | 0.924 | 0.884/0.940 | 0.907 |
101
+ | [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | 0.918/0.963 | 0.891 |
102
+ | [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
103
+ | [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928/0.896 | 0.924 | 0.896/0.956 | 0.900 |
104
+
105
+ ## License
106
+ CC BY SA 4.0
107
+
108
+ ## Acknowledgement
109
+ We used ABCI for computing resources.
110
+
111
+ ---
112
+
113
+ 計算リソースにABCIを利用させていただきました。