akeyhero's picture
Update README.md
6eadfbc verified
|
raw
history blame
No virus
6.55 kB
metadata
license: cc-by-sa-4.0
datasets:
  - globis-university/aozorabunko-clean
  - oscar-corpus/OSCAR-2301
  - Wikipedia
  - WikiBooks
  - CC-100
  - mC4
language:
  - ja

What’s this?

日本語リソースで学習した DeBERTa V3 モデルです。

以下のような特徴を持ちます:

  • 定評のある DeBERTa V3 を用いたモデル
  • 日本語特化
  • 推論時に形態素解析器を用いない
  • 単語境界をある程度尊重する (の都合上の判定負けを喫し のような複数語のトークンを生じさせない)

This is a model based on DeBERTa V3 pre-trained on Japanese resources.

The model has the following features:

  • Based on the well-known DeBERTa V3 model
  • Specialized for the Japanese language
  • Does not use a morphological analyzer during inference
  • Respects word boundaries to some extent (does not produce tokens spanning multiple words like の都合上 or の判定負けを喫し)

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Tokenizer

工藤氏によって示された手法で学習した。

以下のことを意識している:

  • 推論時の形態素解析器なし
  • トークンが単語の境界を跨がない (辞書: unidic-cwj-202302)
  • Hugging Faceで使いやすい
  • 大きすぎない語彙数

本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎる (microsoft/deberta-v3-base モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用している。


The tokenizer is trained using the method introduced by Kudo.

Key points include:

  • No morphological analyzer needed during inference
  • Tokens do not cross word boundaries (dictionary: unidic-cwj-202302)
  • Easy to use with Hugging Face
  • Smaller vocabulary size

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the microsoft/deberta-v3-base model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.

Data

Dataset Name Notes File Size (with metadata) Factor
Wikipedia 2023/07; WikiExtractor 3.5GB x2
Wikipedia 2023/07; cl-tohoku's method 4.8GB x2
WikiBooks 2023/07; cl-tohoku's method 43MB x2
Aozora Bunko 2023/07; globis-university/aozorabunko-clean 496MB x4
CC-100 ja 90GB x1
mC4 ja; extracted 10%, with Wikipedia-like focus via DSIR 91GB x1
OSCAR 2023 ja; extracted 10%, with Wikipedia-like focus via DSIR 26GB x1

Training parameters

  • Number of devices: 8
  • Batch size: 24 x 8
  • Learning rate: 1.92e-4
  • Maximum sequence length: 512
  • Optimizer: AdamW
  • Learning rate scheduler: Linear schedule with warmup
  • Training steps: 1,000,000
  • Warmup steps: 100,000
  • Precision: Mixed (fp16)

Evaluation

Model #params JSTS JNLI JSQuAD JCQA
≤ small
izumi-lab/deberta-v2-small-japanese 17.8M 0.890/0.846 0.880 - 0.737
globis-university/deberta-v3-japanese-xsmall 33.7M 0.916/0.880 0.913 0.869/0.938 0.821
base
cl-tohoku/bert-base-japanese-v3 111M 0.919/0.881 0.907 0.880/0.946 0.848
nlp-waseda/roberta-base-japanese 111M 0.913/0.873 0.895 0.864/0.927 0.840
izumi-lab/deberta-v2-base-japanese 110M 0.919/0.882 0.912 - 0.859
ku-nlp/deberta-v2-base-japanese 112M 0.922/0.886 0.922 0.899/0.951 -
ku-nlp/deberta-v3-base-japanese 160M 0.927/0.891 0.927 0.896/- -
globis-university/deberta-v3-japanese-base 110M 0.925/0.895 0.921 0.890/0.950 0.886
large
cl-tohoku/bert-large-japanese-v2 337M 0.926/0.893 0.929 0.893/0.956 0.893
nlp-waseda/roberta-large-japanese 337M 0.930/0.896 0.924 0.884/0.940 0.907
nlp-waseda/roberta-large-japanese-seq512 337M 0.926/0.892 0.926 0.918/0.963 0.891
ku-nlp/deberta-v2-large-japanese 339M 0.925/0.892 0.924 0.912/0.959 -
globis-university/deberta-v3-japanese-large 352M 0.928/0.896 0.924 0.896/0.956 0.900

License

CC BY SA 4.0

Acknowledgement

計算リソースに ABCI を利用させていただきました。ありがとうございます。


We used ABCI for computing resources. Thank you.