File size: 6,073 Bytes
82533ff
 
 
 
 
 
 
 
 
 
2201e71
82533ff
 
 
2201e71
82533ff
 
 
 
2201e71
 
 
 
 
 
 
 
 
82533ff
 
 
 
 
 
 
 
 
 
2201e71
82533ff
 
 
 
2201e71
82533ff
 
 
 
 
2201e71
 
 
 
 
 
 
 
 
 
 
82533ff
2201e71
82533ff
2201e71
 
 
 
82533ff
2201e71
 
82533ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2201e71
82533ff
 
 
 
2201e71
 
 
82533ff
2201e71
 
 
82533ff
2201e71
82533ff
 
 
 
 
2201e71
82533ff
 
2201e71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: cc-by-sa-4.0
datasets:
- globis-university/aozorabunko-clean
- oscar-corpus/OSCAR-2301
language:
- ja
---

# What’s this?
日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) モデルです。

以下のような特徴を持ちます:

- 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) を用いたモデル
- 日本語特化
- 推論時に形態素解析器を用いない
- 単語境界をある程度尊重する (`の都合上``の判定負けを喫し` のような複数語のトークンを生じさせない)

---
This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) pre-trained on Japanese resources.

The model has the following features:
- Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) model
- Specialized for the Japanese language
- Does not use a morphological analyzer during inference
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)

# How to use
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
```

# Tokenizer
[工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習した。

以下のことを意識している:

- 推論時の形態素解析器なし
- トークンが単語 (`unidic-cwj-202302`) の境界を跨がない
- Hugging Faceで使いやすい
- 大きすぎない語彙数

本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。

---
The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).

Key points include:
- No morphological analyzer needed during inference
- Tokens do not cross word boundaries (`unidic-cwj-202302`)
- Easy to use with Hugging Face
- Smaller vocabulary size

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.

# Data
| Dataset Name  | Notes | File Size (with metadata) | Factor |
| ------------- | ----- | ------------------------- | ---------- |
| Wikipedia     | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
| Wikipedia     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
| WikiBooks     | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
| Aozora Bunko  | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
| CC-100        | ja | 90GB | x1 |
| mC4           | ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
| OSCAR 2023    | ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |

# Training parameters
- Number of devices: 8
- Batch size: 24 x 8
- Learning rate: 1.92e-4
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 1,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)

# Evaluation
| Model | JSTS | JNLI | JSQuAD | JCQA |
| ----- | ---- | ---- | ------ | ---- |
| ≤ small | | | | |
| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 0.890/0.846 | 0.880 | - | 0.737 |
| [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
| base | | | | |
| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 0.919/0.882 | 0.912 | - | 0.859 |
| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | **0.927**/0.891 | **0.927** | 0.896/- | - |
| [**globis-university/deberta-v3-japanese-base**](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
| large | | | | |
| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
| [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
| [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |

## License
CC BY SA 4.0

## Acknowledgement
計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。

---
We used [ABCI](https://abci.ai/) for computing resources. Thank you.