license: cc-by-sa-4.0
datasets:
- globis-university/aozorabunko-clean
- oscar-corpus/OSCAR-2301
- Wikipedia
- WikiBooks
- CC-100
- allenai/c4
language:
- ja
What’s this?
日本語リソースで学習した DeBERTa V3 モデルです。
以下のような特徴を持ちます:
- 定評のある DeBERTa V3 を用いたモデル
- 日本語特化
- 推論時に形態素解析器を用いない
- 単語境界をある程度尊重する (
の都合上
やの判定負けを喫し
のような複数語のトークンを生じさせない)
This is a model based on DeBERTa V3 pre-trained on Japanese resources.
The model has the following features:
- Based on the well-known DeBERTa V3 model
- Specialized for the Japanese language
- Does not use a morphological analyzer during inference
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like
の都合上
orの判定負けを喫し
)
How to use
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
Tokenizer
工藤氏によって示された手法で学習しました。
以下のことを意識しています:
- 推論時の形態素解析器なし
- トークンが単語の境界を跨がない (辞書:
unidic-cwj-202302
) - Hugging Faceで使いやすい
- 大きすぎない語彙数
本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴がありますが、反面埋め込み層のパラメータ数が大きくなりすぎる (microsoft/deberta-v3-base モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用しています。
The tokenizer is trained using the method introduced by Kudo.
Key points include:
- No morphological analyzer needed during inference
- Tokens do not cross word boundaries (dictionary:
unidic-cwj-202302
) - Easy to use with Hugging Face
- Smaller vocabulary size
Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the microsoft/deberta-v3-base model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.
Data
Dataset Name | Notes | File Size (with metadata) | Factor |
---|---|---|---|
Wikipedia | 2023/07; WikiExtractor | 3.5GB | x2 |
Wikipedia | 2023/07; cl-tohoku's method | 4.8GB | x2 |
WikiBooks | 2023/07; cl-tohoku's method | 43MB | x2 |
Aozora Bunko | 2023/07; globis-university/aozorabunko-clean | 496MB | x4 |
CC-100 | ja | 90GB | x1 |
mC4 | ja; extracted 10%, with Wikipedia-like focus via DSIR | 91GB | x1 |
OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via DSIR | 26GB | x1 |
Training parameters
- Number of devices: 8
- Batch size: 24 x 8
- Learning rate: 1.92e-4
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 1,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)
Evaluation
Model | #params | JSTS | JNLI | JSQuAD | JCQA |
---|---|---|---|---|---|
≤ small | |||||
izumi-lab/deberta-v2-small-japanese | 17.8M | 0.890/0.846 | 0.880 | - | 0.737 |
globis-university/deberta-v3-japanese-xsmall | 33.7M | 0.916/0.880 | 0.913 | 0.869/0.938 | 0.821 |
base | |||||
cl-tohoku/bert-base-japanese-v3 | 111M | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
nlp-waseda/roberta-base-japanese | 111M | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
izumi-lab/deberta-v2-base-japanese | 110M | 0.919/0.882 | 0.912 | - | 0.859 |
ku-nlp/deberta-v2-base-japanese | 112M | 0.922/0.886 | 0.922 | 0.899/0.951 | - |
ku-nlp/deberta-v3-base-japanese | 160M | 0.927/0.891 | 0.927 | 0.896/- | - |
globis-university/deberta-v3-japanese-base | 110M | 0.925/0.895 | 0.921 | 0.890/0.950 | 0.886 |
large | |||||
cl-tohoku/bert-large-japanese-v2 | 337M | 0.926/0.893 | 0.929 | 0.893/0.956 | 0.893 |
nlp-waseda/roberta-large-japanese | 337M | 0.930/0.896 | 0.924 | 0.884/0.940 | 0.907 |
nlp-waseda/roberta-large-japanese-seq512 | 337M | 0.926/0.892 | 0.926 | 0.918/0.963 | 0.891 |
ku-nlp/deberta-v2-large-japanese | 339M | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
globis-university/deberta-v3-japanese-large | 352M | 0.928/0.896 | 0.924 | 0.896/0.956 | 0.900 |
License
CC BY SA 4.0
Acknowledgement
計算リソースに ABCI を利用させていただきました。ありがとうございます。
We used ABCI for computing resources. Thank you.