Update README.md
Browse files
README.md
CHANGED
@@ -8,25 +8,24 @@ language:
|
|
8 |
---
|
9 |
|
10 |
# What’s this?
|
11 |
-
|
12 |
-
|
13 |
-
The model has the following features:
|
14 |
-
- Based on the well-known DeBERTa V3 model
|
15 |
-
- Specialized for the Japanese language
|
16 |
-
- Does not use a morphological analyzer during inference
|
17 |
-
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
|
18 |
-
|
19 |
-
---
|
20 |
-
|
21 |
-
日本語リソースで学習した DeBERTa V3 モデルです。
|
22 |
|
23 |
以下のような特徴を持ちます:
|
24 |
|
25 |
-
- 定評のある DeBERTa V3 を用いたモデル
|
26 |
- 日本語特化
|
27 |
- 推論時に形態素解析器を用いない
|
28 |
- 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
# How to use
|
31 |
```python
|
32 |
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
@@ -37,39 +36,38 @@ model = AutoModelForTokenClassification.from_pretrained(model_name)
|
|
37 |
```
|
38 |
|
39 |
# Tokenizer
|
40 |
-
|
41 |
-
|
42 |
-
Key points include:
|
43 |
-
- No morphological analyzer needed during inference
|
44 |
-
- Tokens do not cross word boundaries
|
45 |
-
- Easy to use with Hugging Face
|
46 |
-
- Smaller vocabulary size
|
47 |
-
|
48 |
-
Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.
|
49 |
-
|
50 |
-
---
|
51 |
-
|
52 |
-
工藤氏によって示された手法で学習した。
|
53 |
|
54 |
以下のことを意識している:
|
55 |
|
56 |
- 推論時の形態素解析器なし
|
57 |
-
-
|
58 |
- Hugging Faceで使いやすい
|
59 |
- 大きすぎない語彙数
|
60 |
|
61 |
本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
# Data
|
64 |
-
| Dataset Name | Notes | File Size (with metadata) |
|
65 |
| ------------- | ----- | ------------------------- | ---------- |
|
66 |
-
| Wikipedia | 2023/07; WikiExtractor | 3.5GB | x2 |
|
67 |
-
| Wikipedia | 2023/07; cl-tohoku's method | 4.8GB | x2 |
|
68 |
-
| WikiBooks | 2023/07; cl-tohoku's method | 43MB | x2 |
|
69 |
-
| Aozora Bunko | 2023/07; globis-university/aozorabunko-clean | 496MB | x4 |
|
70 |
| CC-100 | ja | 90GB | x1 |
|
71 |
-
| mC4 | ja; extracted 10% of Wikipedia-like data using DSIR | 91GB | x1 |
|
72 |
-
| OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using DSIR | 26GB | x1 |
|
73 |
|
74 |
# Training parameters
|
75 |
- Number of devices: 8
|
@@ -87,27 +85,26 @@ Although the original DeBERTa V3 is characterized by a large vocabulary size, wh
|
|
87 |
| ----- | ---- | ---- | ------ | ---- |
|
88 |
| ≤ small | | | | |
|
89 |
| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
|
90 |
-
| [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 0.916
|
91 |
| base | | | | |
|
92 |
| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
|
93 |
| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
|
94 |
| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
|
95 |
-
| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | 0.899
|
96 |
-
| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 0.927
|
97 |
-
| [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925
|
98 |
| large | | | | |
|
99 |
-
| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | 0.929 | 0.893/0.956 | 0.893 |
|
100 |
-
| [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 0.930
|
101 |
-
| [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | 0.918
|
102 |
| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
|
103 |
-
| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928
|
104 |
|
105 |
## License
|
106 |
CC BY SA 4.0
|
107 |
|
108 |
## Acknowledgement
|
109 |
-
|
110 |
|
111 |
---
|
112 |
-
|
113 |
-
計算リソースにABCIを利用させていただきました。
|
|
|
8 |
---
|
9 |
|
10 |
# What’s this?
|
11 |
+
日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) モデルです。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
以下のような特徴を持ちます:
|
14 |
|
15 |
+
- 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) を用いたモデル
|
16 |
- 日本語特化
|
17 |
- 推論時に形態素解析器を用いない
|
18 |
- 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
|
19 |
|
20 |
+
---
|
21 |
+
This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) pre-trained on Japanese resources.
|
22 |
+
|
23 |
+
The model has the following features:
|
24 |
+
- Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) model
|
25 |
+
- Specialized for the Japanese language
|
26 |
+
- Does not use a morphological analyzer during inference
|
27 |
+
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
|
28 |
+
|
29 |
# How to use
|
30 |
```python
|
31 |
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
|
|
36 |
```
|
37 |
|
38 |
# Tokenizer
|
39 |
+
[工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習した。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
以下のことを意識している:
|
42 |
|
43 |
- 推論時の形態素解析器なし
|
44 |
+
- トークンが単語 (`unidic-cwj-202302`) の境界を跨がない
|
45 |
- Hugging Faceで使いやすい
|
46 |
- 大きすぎない語彙数
|
47 |
|
48 |
本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。
|
49 |
|
50 |
+
---
|
51 |
+
The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).
|
52 |
+
|
53 |
+
Key points include:
|
54 |
+
- No morphological analyzer needed during inference
|
55 |
+
- Tokens do not cross word boundaries (`unidic-cwj-202302`)
|
56 |
+
- Easy to use with Hugging Face
|
57 |
+
- Smaller vocabulary size
|
58 |
+
|
59 |
+
Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.
|
60 |
+
|
61 |
# Data
|
62 |
+
| Dataset Name | Notes | File Size (with metadata) | Factor |
|
63 |
| ------------- | ----- | ------------------------- | ---------- |
|
64 |
+
| Wikipedia | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
|
65 |
+
| Wikipedia | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
|
66 |
+
| WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
|
67 |
+
| Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
|
68 |
| CC-100 | ja | 90GB | x1 |
|
69 |
+
| mC4 | ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
|
70 |
+
| OSCAR 2023 | ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
|
71 |
|
72 |
# Training parameters
|
73 |
- Number of devices: 8
|
|
|
85 |
| ----- | ---- | ---- | ------ | ---- |
|
86 |
| ≤ small | | | | |
|
87 |
| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
|
88 |
+
| [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
|
89 |
| base | | | | |
|
90 |
| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
|
91 |
| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
|
92 |
| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
|
93 |
+
| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
|
94 |
+
| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | **0.927**/0.891 | **0.927** | 0.896/- | - |
|
95 |
+
| [**globis-university/deberta-v3-japanese-base**](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
|
96 |
| large | | | | |
|
97 |
+
| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
|
98 |
+
| [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
|
99 |
+
| [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
|
100 |
| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
|
101 |
+
| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |
|
102 |
|
103 |
## License
|
104 |
CC BY SA 4.0
|
105 |
|
106 |
## Acknowledgement
|
107 |
+
計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。
|
108 |
|
109 |
---
|
110 |
+
We used [ABCI](https://abci.ai/) for computing resources. Thank you.
|
|