Update README.md

0178777 verified 5 months ago

6.11 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- globis-university/aozorabunko-clean
	- oscar-corpus/OSCAR-2301
	- Wikipedia
	- WikiBooks
	- CC-100
	- mC4
	language:
	- ja
	---

	# What’s this?
	日本語リソースで学習した [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) モデルです。

	以下のような特徴を持ちます:

	- 定評のある [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) を用いたモデル
	- 日本語特化
	- 推論時に形態素解析器を用いない
	- 単語境界をある程度尊重する (`の都合上` や `の判定負けを喫し` のような複数語のトークンを生じさせない)

	---
	This is a model based on [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) pre-trained on Japanese resources.

	The model has the following features:
	- Based on the well-known [DeBERTa V3](https://huggingface.co/microsoft/deberta-v3-base) model
	- Specialized for the Japanese language
	- Does not use a morphological analyzer during inference
	- Respects word boundaries to some extent (does not produce tokens spanning multiple words like `の都合上` or `の判定負けを喫し`)

	# How to use
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	model_name = 'globis-university/deberta-v3-japanese-base'
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)
	```

	# Tokenizer
	[工藤氏によって示された手法](https://qiita.com/taku910/items/fbaeab4684665952d5a9)で学習した。

	以下のことを意識している:

	- 推論時の形態素解析器なし
	- トークンが単語 (`unidic-cwj-202302`) の境界を跨がない
	- Hugging Faceで使いやすい
	- 大きすぎない語彙数

	本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。

	---
	The tokenizer is trained using [the method introduced by Kudo](https://qiita.com/taku910/items/fbaeab4684665952d5a9).

	Key points include:
	- No morphological analyzer needed during inference
	- Tokens do not cross word boundaries (`unidic-cwj-202302`)
	- Easy to use with Hugging Face
	- Smaller vocabulary size

	Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.

	# Data
	\| Dataset Name \| Notes \| File Size (with metadata) \| Factor \|
	\| ------------- \| ----- \| ------------------------- \| ---------- \|
	\| Wikipedia \| 2023/07; [WikiExtractor](https://github.com/attardi/wikiextractor) \| 3.5GB \| x2 \|
	\| Wikipedia \| 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) \| 4.8GB \| x2 \|
	\| WikiBooks \| 2023/07; [cl-tohoku's method](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) \| 43MB \| x2 \|
	\| Aozora Bunko \| 2023/07; [globis-university/aozorabunko-clean](https://huggingface.co/globis-university/globis-university/aozorabunko-clean) \| 496MB \| x4 \|
	\| CC-100 \| ja \| 90GB \| x1 \|
	\| mC4 \| ja; extracted 10% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) \| 91GB \| x1 \|
	\| OSCAR 2023 \| ja; extracted 20% of Wikipedia-like data using [DSIR](https://arxiv.org/abs/2302.03169) \| 26GB \| x1 \|

	# Training parameters
	- Number of devices: 8
	- Batch size: 24 x 8
	- Learning rate: 1.92e-4
	- Maximum sequence length: 512
	- Optimizer: AdamW
	- Learning rate scheduler: Linear schedule with warmup
	- Training steps: 1,000,000
	- Warmup steps: 100,000
	- Precision: Mixed (fp16)

	# Evaluation
	\| Model \| JSTS \| JNLI \| JSQuAD \| JCQA \|
	\| ----- \| ---- \| ---- \| ------ \| ---- \|
	\| ≤ small \| \| \| \| \|
	\| [izumi-lab/deberta-v2-small-japanese](https://huggingface.co/izumi-lab/deberta-v2-small-japanese) \| 0.890/0.846 \| 0.880 \| - \| 0.737 \|
	\| [globis-university/deberta-v3-japanese-xsmall](https://huggingface.co/globis-university/deberta-v3-japanese-xsmall) \| 0.916/0.880 \| 0.913 \| 0.869/0.938 \| 0.821 \|
	\| base \| \| \| \| \|
	\| [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) \| 0.919/0.881 \| 0.907 \| 0.880/0.946 \| 0.848 \|
	\| [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) \| 0.913/0.873 \| 0.895 \| 0.864/0.927 \| 0.840 \|
	\| [izumi-lab/deberta-v2-base-japanese](https://huggingface.co/izumi-lab/deberta-v2-base-japanese) \| 0.919/0.882 \| 0.912 \| - \| 0.859 \|
	\| [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) \| 0.922/0.886 \| 0.922 \| 0.899/0.951 \| - \|
	\| [ku-nlp/deberta-v3-base-japanese](https://huggingface.co/ku-nlp/deberta-v3-base-japanese) \| 0.927/0.891 \| 0.927 \| 0.896/- \| - \|
	\| [globis-university/deberta-v3-japanese-base](https://huggingface.co/globis-university/deberta-v3-japanese-base) \| 0.925/0.895 \| 0.921 \| 0.890/0.950 \| 0.886 \|
	\| large \| \| \| \| \|
	\| [cl-tohoku/bert-large-japanese-v2](https://huggingface.co/cl-tohoku/bert-large-japanese-v2) \| 0.926/0.893 \| 0.929 \| 0.893/0.956 \| 0.893 \|
	\| [roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese) \| 0.930/0.896 \| 0.924 \| 0.884/0.940 \| 0.907 \|
	\| [roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512) \| 0.926/0.892 \| 0.926 \| 0.918/0.963 \| 0.891 \|
	\| [ku-nlp/deberta-v2-large-japanese](https://huggingface.co/ku-nlp/deberta-v2-large-japanese) \| 0.925/0.892 \| 0.924 \| 0.912/0.959 \| - \|
	\| [globis-university/deberta-v3-japanese-large](https://huggingface.co/globis-university/deberta-v3-japanese-large) \| 0.928/0.896 \| 0.924 \| 0.896/0.956 \| 0.900 \|

	## License
	CC BY SA 4.0

	## Acknowledgement
	計算リソースに [ABCI](https://abci.ai/) を利用させていただきました。ありがとうございます。

	---
	We used [ABCI](https://abci.ai/) for computing resources. Thank you.