akeyhero commited on
Commit
4f460bb
1 Parent(s): a997f24

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - globis-university/aozorabunko-clean
5
+ - oscar-corpus/OSCAR-2301
6
+ - Wikipedia
7
+ - WikiBooks
8
+ - CC-100
9
+ - allenai/c4
10
+ language:
11
+ - ja
12
+ ---
13
+
14
+ # What’s this?
15
+ 日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) モデルです。
16
+
17
+ 以下のような特徴を持ちます:
18
+
19
+ - 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) を用いたモデル
20
+ - 日本語特化
21
+ - 推論時に形態素解析器を用いない
22
+ - 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
23
+
24
+ ---
25
+ This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) pre-trained on Japanese resources.
26
+
27
+ The model has the following features:
28
+ - Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) model
29
+ - Specialized for the Japanese language
30
+ - Does not use a morphological analyzer during inference
31
+ - Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
32
+
33
+ # How to use
34
+ ```python
35
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
36
+
37
+ model_name = 'globis-university/deberta-v3-japanese-xsmall'
38
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
39
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
40
+ ```
41
+
42
+ # Tokenizer
43
+ [工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習しました。
44
+
45
+ 以下のことを意識しています:
46
+
47
+ - 推論時の形態素解析器なし
48
+ - トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
49
+ - Hugging Faceで使いやすい
50
+ - 大きすぎない語彙数
51
+
52
+ 本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴がありますが、反面埋め込み層のパラメータ数が大きくなりすぎる ([microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用しています。
53
+
54
+ ---
55
+ The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).
56
+
57
+ Key points include:
58
+ - No morphological analyzer needed during inference
59
+ - Tokens do not cross word boundaries (dictionary: `unidic-cwj-202302`)
60
+ - Easy to use with Hugging Face
61
+ - Smaller vocabulary size
62
+
63
+ Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.
64
+
65
+ # Data
66
+ | Dataset Name | Notes | File Size (with metadata) | Factor |
67
+ | ------------- | ----- | ------------------------- | ---------- |
68
+ | Wikipedia | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
69
+ | Wikipedia | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
70
+ | WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
71
+ | Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
72
+ | CC-100 | ja | 90GB | x1 |
73
+ | mC4 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
74
+ | OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
75
+
76
+ # Training parameters
77
+ - Number of devices: 8
78
+ - Batch size: 48 x 8
79
+ - Learning rate: 3.84e-4
80
+ - Maximum sequence length: 512
81
+ - Optimizer: AdamW
82
+ - Learning rate scheduler: Linear schedule with warmup
83
+ - Training steps: 1,000,000
84
+ - Warmup steps: 100,000
85
+ - Precision: Mixed (fp16)
86
+
87
+ # Evaluation
88
+ | Model | #params | JSTS | JNLI | JSQuAD | JCQA |
89
+ | ----- | ------- | ---- | ---- | ------ | ---- |
90
+ | ≤ small | | | | | |
91
+ | [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 17.8M | 0.890/0.846 | 0.880 | - | 0.737 |
92
+ | [**globis-university/deberta-v3-japanese-xsmall**](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 33.7M | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
93
+ | base | | | | |
94
+ | [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 111M | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
95
+ | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 111M | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
96
+ | [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 110M | 0.919/0.882 | 0.912 | - | 0.859 |
97
+ | [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 112M | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
98
+ | [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 160M | **0.927**/0.891 | **0.927** | 0.896/- | - |
99
+ | [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 110M | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
100
+ | large | | | | | |
101
+ | [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 337M | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
102
+ | [nlp-waseda/roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 337M | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
103
+ | [nlp-waseda/roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 337M | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
104
+ | [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 339M | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
105
+ | [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 352M | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |
106
+
107
+ ## License
108
+ CC BY SA 4.0
109
+
110
+ ## Acknowledgement
111
+ 計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。
112
+
113
+ ---
114
+ We used [ABCI](https://abci.ai/) for computing resources. Thank you.