metadata

license: cc-by-sa-4.0
datasets:
  - globis-university/aozorabunko-clean
  - oscar-corpus/OSCAR-2301
  - Wikipedia
  - WikiBooks
  - CC-100
  - mC4
language:
  - ja

What’s this?

日本語リソースで学習した DeBERTa V3 モデルです。

以下のような特徴を持ちます:

定評のある DeBERTa V3 を用いたモデル
日本語特化
推論時に形態素解析器を用いない
単語境界をある程度尊重する (の都合上 や の判定負けを喫し のような複数語のトークンを生じさせない)

This is a model based on DeBERTa V3 pre-trained on Japanese resources.

The model has the following features:

Based on the well-known DeBERTa V3 model
Specialized for the Japanese language
Does not use a morphological analyzer during inference
Respects word boundaries to some extent (does not produce tokens spanning multiple words like の都合上 or の判定負けを喫し)

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Tokenizer

工藤氏によって示された手法で学習した。

以下のことを意識している:

推論時の形態素解析器なし
トークンが単語の境界を跨がない (辞書: unidic-cwj-202302)
Hugging Faceで使いやすい
大きすぎない語彙数

本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎる (microsoft/deberta-v3-base モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用している。

The tokenizer is trained using the method introduced by Kudo.

Key points include:

No morphological analyzer needed during inference
Tokens do not cross word boundaries (dictionary: unidic-cwj-202302)
Easy to use with Hugging Face
Smaller vocabulary size

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the microsoft/deberta-v3-base model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.

Data

Dataset Name	Notes	File Size (with metadata)	Factor
Wikipedia	2023/07; WikiExtractor	3.5GB	x2
Wikipedia	2023/07; cl-tohoku's method	4.8GB	x2
WikiBooks	2023/07; cl-tohoku's method	43MB	x2
Aozora Bunko	2023/07; globis-university/aozorabunko-clean	496MB	x4
CC-100	ja	90GB	x1
mC4	ja; extracted 10%, with Wikipedia-like focus via DSIR	91GB	x1
OSCAR 2023	ja; extracted 10%, with Wikipedia-like focus via DSIR	26GB	x1

Training parameters

Number of devices: 8
Batch size: 24 x 8
Learning rate: 1.92e-4
Maximum sequence length: 512
Optimizer: AdamW
Learning rate scheduler: Linear schedule with warmup
Training steps: 1,000,000
Warmup steps: 100,000
Precision: Mixed (fp16)

Evaluation

Model	#params	JSTS	JNLI	JSQuAD	JCQA
≤ small
izumi-lab/deberta-v2-small-japanese	17.8M	0.890/0.846	0.880	-	0.737
globis-university/deberta-v3-japanese-xsmall	33.7M	0.916/0.880	0.913	0.869/0.938	0.821
base
cl-tohoku/bert-base-japanese-v3	111M	0.919/0.881	0.907	0.880/0.946	0.848
nlp-waseda/roberta-base-japanese	111M	0.913/0.873	0.895	0.864/0.927	0.840
izumi-lab/deberta-v2-base-japanese	110M	0.919/0.882	0.912	-	0.859
ku-nlp/deberta-v2-base-japanese	112M	0.922/0.886	0.922	0.899/0.951	-
ku-nlp/deberta-v3-base-japanese	160M	0.927/0.891	0.927	0.896/-	-
globis-university/deberta-v3-japanese-base	110M	0.925/0.895	0.921	0.890/0.950	0.886
large
cl-tohoku/bert-large-japanese-v2	337M	0.926/0.893	0.929	0.893/0.956	0.893
nlp-waseda/roberta-large-japanese	337M	0.930/0.896	0.924	0.884/0.940	0.907
nlp-waseda/roberta-large-japanese-seq512	337M	0.926/0.892	0.926	0.918/0.963	0.891
ku-nlp/deberta-v2-large-japanese	339M	0.925/0.892	0.924	0.912/0.959	-
globis-university/deberta-v3-japanese-large	352M	0.928/0.896	0.924	0.896/0.956	0.900

License

CC BY SA 4.0

Acknowledgement

計算リソースに ABCI を利用させていただきました。ありがとうございます。

We used ABCI for computing resources. Thank you.