File size: 7,676 Bytes
4f460bb cc5f99f 4f460bb cc5f99f 4f460bb cc5f99f 4f460bb cc5f99f 4f460bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
license: cc-by-sa-4.0
datasets:
- globis-university/aozorabunko-clean
- oscar-corpus/OSCAR-2301
- Wikipedia
- WikiBooks
- CC-100
- allenai/c4
language:
- ja
library_name: transformers
---
# What’s this?
日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) モデルです。
以下のような特徴を持ちます:
- 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) を用いたモデル
- 日本語特化
- 推論時に形態素解析器を用いない
- 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)
---
This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) pre-trained on Japanese resources.
The model has the following features:
- Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-xsmall) model
- Specialized for the Japanese language
- Does not use a morphological analyzer during inference
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)
# How to use
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = 'globis-university/deberta-v3-japanese-xsmall'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
```
# Tokenizer
[工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習しました。
以下のことを意識しています:
- 推論時の形態素解析器なし
- トークンが単語の境界を跨がない (辞書: `unidic-cwj-202302`)
- Hugging Faceで使いやすい
- 大きすぎない語彙数
本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴がありますが、反面埋め込み層のパラメータ数が大きくなりすぎる ([microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用しています。
注意点として、 `xsmall` 、 `base` 、 `large` の 3 つのモデルのうち、前者二つは unigram アルゴリズムで学習しているが、 `large` モデルのみ BPE アルゴリズムで学習している。
深い理由はなく、 `large` モデルのみ語彙サイズを増やすために独立して学習を行ったが、なぜか unigram アルゴリズムでの学習がうまくいかなかったことが原因である。
原因の探究よりモデルの完成を優先して、 BPE アルゴリズムに切り替えた。
---
The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).
Key points include:
- No morphological analyzer needed during inference
- Tokens do not cross word boundaries (dictionary: `unidic-cwj-202302`)
- Easy to use with Hugging Face
- Smaller vocabulary size
Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.
Note that, among the three models: xsmall, base, and large, the first two were trained using the unigram algorithm, while only the large model was trained using the BPE algorithm.
The reason for this is simple: while the large model was independently trained to increase its vocabulary size, for some reason, training with the unigram algorithm was not successful.
Thus, prioritizing the completion of the model over investigating the cause, we switched to the BPE algorithm.
# Data
| Dataset Name | Notes | File Size (with metadata) | Factor |
| ------------- | ----- | ------------------------- | ---------- |
| Wikipedia | 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) | 3.5GB | x2 |
| Wikipedia | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 4.8GB | x2 |
| WikiBooks | 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) | 43MB | x2 |
| Aozora Bunko | 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) | 496MB | x4 |
| CC-100 | ja | 90GB | x1 |
| mC4 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 91GB | x1 |
| OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via [DSIR](https://arxiv.org/abs/2302.03169) | 26GB | x1 |
# Training parameters
- Number of devices: 8
- Batch size: 48 x 8
- Learning rate: 3.84e-4
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 1,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)
- Vocabulary size: 32,000
# Evaluation
| Model | #params | JSTS | JNLI | JSQuAD | JCQA |
| ----- | ------- | ---- | ---- | ------ | ---- |
| ≤ small | | | | | |
| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) | 17.8M | 0.890/0.846 | 0.880 | - | 0.737 |
| [**globis-university/deberta-v3-japanese-xsmall**](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) | 33.7M | **0.916**/**0.880** | **0.913** | **0.869**/**0.938** | **0.821** |
| base | | | | |
| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) | 111M | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | 111M | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) | 110M | 0.919/0.882 | 0.912 | - | 0.859 |
| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) | 112M | 0.922/0.886 | 0.922 | **0.899**/**0.951** | - |
| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) | 160M | **0.927**/0.891 | **0.927** | 0.896/- | - |
| [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) | 110M | 0.925/**0.895** | 0.921 | 0.890/0.950 | **0.886** |
| large | | | | | |
| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) | 337M | 0.926/0.893 | **0.929** | 0.893/0.956 | 0.893 |
| [nlp-waseda/roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) | 337M | **0.930**/**0.896** | 0.924 | 0.884/0.940 | **0.907** |
| [nlp-waseda/roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) | 337M | 0.926/0.892 | 0.926 | **0.918**/**0.963** | 0.891 |
| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) | 339M | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) | 352M | 0.928/**0.896** | 0.924 | 0.896/0.956 | 0.900 |
## License
CC BY SA 4.0
## Acknowledgement
計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。
---
We used [ABCI](https://abci.ai/) for computing resources. Thank you. |