akeyhero commited on
Commit
2201e71
1 Parent(s): 82533ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -44
README.md CHANGED
@@ -8,25 +8,24 @@ language:
8
  ---
9
 
10
  # What’s this?
11
- This is a model based on DeBERTa V3 pre-trained on Japanese resources.
12
-
13
- The model has the following features:
14
- - Based on the well-known DeBERTa V3 model
15
- - Specialized for the Japanese language
16
- - Does not use a morphological analyzer during inference
17
- - Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
18
-
19
- ---
20
-
21
- 日本語リソースで学習した DeBERTa V3 モデルです。
22
 
23
  以下のような特徴を持ちます:
24
 
25
- - 定評のある DeBERTa V3 を用いたモデル
26
  - 日本語特化
27
  - 推論時に形態素解析器を用いない
28
  - 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
29
 
 
 
 
 
 
 
 
 
 
30
  # How to use
31
  ```python
32
  from transformers import AutoTokenizer, AutoModelForTokenClassification
@@ -37,39 +36,38 @@ model = AutoModelForTokenClassification.from_pretrained(model_name)
37
  ```
38
 
39
  # Tokenizer
40
- The tokenizer is trained using the method demonstrated by Kudo.
41
-
42
- Key points include:
43
- - No morphological analyzer needed during inference
44
- - Tokens do not cross word boundaries
45
- - Easy to use with Hugging Face
46
- - Smaller vocabulary size
47
-
48
- Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.
49
-
50
- ---
51
-
52
- 工藤氏によって示された手法で学習した。
53
 
54
  以下のことを意識している:
55
 
56
  - 推論時の形態素解析器なし
57
- - トークンが単語の境界を跨がない
58
  - Hugging Faceで使いやすい
59
  - 大きすぎない語彙数
60
 
61
  本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。
62
 
 
 
 
 
 
 
 
 
 
 
 
63
  # Data
64
- | Dataset Name | Notes | File Size (with metadata) | Times |
65
  | ------------- | ----- | ------------------------- | ---------- |
66
- | Wikipedia | 2023/07; WikiExtractor | 3.5GB | x2 |
67
- | Wikipedia | 2023/07; cl-tohoku's method | 4.8GB | x2 |
68
- | WikiBooks | 2023/07; cl-tohoku's method | 43MB | x2 |
69
- | Aozora Bunko | 2023/07; globis-university/aozorabunko-clean | 496MB | x4 |
70
  | CC-100 | ja | 90GB | x1 |
71
- | mC4 | ja; extracted 10% of Wikipedia-like data using DSIR | 91GB | x1 |
72
- | OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using DSIR | 26GB | x1 |
73
 
74
  # Training parameters
75
  - Number of devices: 8
@@ -87,27 +85,26 @@ Although the original DeBERTa V3 is characterized by a large vocabulary size, wh
87
  | ----- | ---- | ---- | ------ | ---- |
88
  | ≤ small | | | | |
89
  | [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
90
- | [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 0.916/0.880 | 0.913 | 0.869/0.938 | 0.821 |
91
  | base | | | | |
92
  | [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
93
  | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
94
  | [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
95
- | [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | 0.899/0.951 | - |
96
- | [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 0.927/0.891 | 0.927 | 0.896/- | - |
97
- | [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925/0.895 | 0.921 | 0.890/0.950 | 0.886 |
98
  | large | | | | |
99
- | [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | 0.929 | 0.893/0.956 | 0.893 |
100
- | [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 0.930/0.896 | 0.924 | 0.884/0.940 | 0.907 |
101
- | [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | 0.918/0.963 | 0.891 |
102
  | [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
103
- | [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928/0.896 | 0.924 | 0.896/0.956 | 0.900 |
104
 
105
  ## License
106
  CC BY SA 4.0
107
 
108
  ## Acknowledgement
109
- We used ABCI for computing resources.
110
 
111
  ---
112
-
113
- 計算リソースにABCIを利用させていただきました。
 
8
  ---
9
 
10
  # What’s this?
11
+ 日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) モデルです。
 
 
 
 
 
 
 
 
 
 
12
 
13
  以下のような特徴を持ちます:
14
 
15
+ - 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) を用いたモデル
16
  - 日本語特化
17
  - 推論時に形態素解析器を用いない
18
  - 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
19
 
20
+ ---
21
+ This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) pre-trained on Japanese resources.
22
+
23
+ The model has the following features:
24
+ - Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) model
25
+ - Specialized for the Japanese language
26
+ - Does not use a morphological analyzer during inference
27
+ - Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
28
+
29
  # How to use
30
  ```python
31
  from transformers import AutoTokenizer, AutoModelForTokenClassification
 
36
  ```
37
 
38
  # Tokenizer
39
+ [工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習した。
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  以下のことを意識している:
42
 
43
  - 推論時の形態素解析器なし
44
+ - トークンが単語 (`unidic-cwj-202302`) の境界を跨がない
45
  - Hugging Faceで使いやすい
46
  - 大きすぎない語彙数
47
 
48
  本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。
49
 
50
+ ---
51
+ The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).
52
+
53
+ Key points include:
54
+ - No morphological analyzer needed during inference
55
+ - Tokens do not cross word boundaries (`unidic-cwj-202302`)
56
+ - Easy to use with Hugging Face
57
+ - Smaller vocabulary size
58
+
59
+ Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.
60
+
61
  # Data
62
+ | Dataset Name | Notes | File Size (with metadata) | Factor |
63
  | ------------- | ----- | ------------------------- | ---------- |
64
+ | Wikipedia | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
65
+ | Wikipedia | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
66
+ | WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
67
+ | Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
68
  | CC-100 | ja | 90GB | x1 |
69
+ | mC4 | ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
70
+ | OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
71
 
72
  # Training parameters
73
  - Number of devices: 8
 
85
  | ----- | ---- | ---- | ------ | ---- |
86
  | ≤ small | | | | |
87
  | [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
88
+ | [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
89
  | base | | | | |
90
  | [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
91
  | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
92
  | [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
93
+ | [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
94
+ | [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | **0.927**/0.891 | **0.927** | 0.896/- | - |
95
+ | [**globis-university/deberta-v3-japanese-base**](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
96
  | large | | | | |
97
+ | [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
98
+ | [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
99
+ | [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
100
  | [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
101
+ | [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |
102
 
103
  ## License
104
  CC BY SA 4.0
105
 
106
  ## Acknowledgement
107
+ 計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。
108
 
109
  ---
110
+ We used [ABCI](https://abci.ai/) for computing resources. Thank you.