File size: 6,553 Bytes
82533ff
 
 
 
 
0178777
 
 
 
82533ff
 
 
 
 
2201e71
82533ff
 
 
2201e71
82533ff
 
 
 
2201e71
 
 
 
 
 
 
 
 
82533ff
 
 
 
 
 
 
 
 
 
2201e71
82533ff
 
 
 
6eadfbc
82533ff
 
 
d252165
82533ff
2201e71
 
 
 
 
0a7cd57
2201e71
 
 
d252165
2201e71
82533ff
2201e71
82533ff
2201e71
 
 
 
82533ff
6eadfbc
 
82533ff
 
 
 
 
 
 
 
 
 
 
 
 
d252165
 
 
 
 
82533ff
d252165
 
 
 
 
 
 
 
 
 
 
 
82533ff
 
 
 
 
2201e71
82533ff
 
2201e71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: cc-by-sa-4.0
datasets:
- globis-university/aozorabunko-clean
- oscar-corpus/OSCAR-2301
- Wikipedia
- WikiBooks
- CC-100
- mC4
language:
- ja
---

# What’s this?
日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) モデルです。

以下のような特徴を持ちます:

- 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) を用いたモデル
- 日本語特化
- 推論時に形態素解析器を用いない
- 単語境界をある程度尊重する (`の都合上``の判定負けを喫し` のような複数語のトークンを生じさせない)

---
This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) pre-trained on Japanese resources.

The model has the following features:
- Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) model
- Specialized for the Japanese language
- Does not use a morphological analyzer during inference
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)

# How to use
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
```

# Tokenizer
[工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習した。

以下のことを意識している:

- 推論時の形態素解析器なし
- トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
- Hugging Faceで使いやすい
- 大きすぎない語彙数

本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎる ([microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用している。

---
The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).

Key points include:
- No morphological analyzer needed during inference
- Tokens do not cross word boundaries (dictionary: `unidic-cwj-202302`)
- Easy to use with Hugging Face
- Smaller vocabulary size

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.

# Data
| Dataset Name  | Notes | File Size (with metadata) | Factor |
| ------------- | ----- | ------------------------- | ---------- |
| Wikipedia     | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
| Wikipedia     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
| WikiBooks     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
| Aozora Bunko  | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
| CC-100        | ja | 90GB | x1 |
| mC4           | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
| OSCAR 2023    | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |

# Training parameters
- Number of devices: 8
- Batch size: 24 x 8
- Learning rate: 1.92e-4
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 1,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)

# Evaluation
| Model | #params | JSTS | JNLI | JSQuAD | JCQA |
| ----- | ------- | ---- | ---- | ------ | ---- |
| ≤ small | | | | | |
| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 17.8M | 0.890/0.846 | 0.880 | - | 0.737 |
| [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 33.7M | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
| base | | | | |
| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 111M | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 111M | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 110M | 0.919/0.882 | 0.912 | - | 0.859 |
| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 112M | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 160M | **0.927**/0.891 | **0.927** | 0.896/- | - |
| [**globis-university/deberta-v3-japanese-base**](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 110M | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
| large | | | | | |
| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 337M | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
| [nlp-waseda/roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 337M | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
| [nlp-waseda/roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 337M | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 339M | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 352M | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |

## License
CC BY SA 4.0

## Acknowledgement
計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。

---
We used [ABCI](https://abci.ai/) for computing resources. Thank you.