Model Card for japanese-spoken-language-bert

日本語READMEはこちら

These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.

We provide three models down below:

1-6 layer-wise (Folder Name: models/1-6_layer-wise)
Fine-Tuned only 1st-6th layers in Encoder on CSJ.
TAPT512 60k (Folder Name: models/tapt512_60k)
Fine-Tuned on CSJ.
DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
Fine-Tuned on the diet record and CSJ.

Model Card for japanese-spoken-language-bert
Table of Contents
Model Details
- Model Description
Training Details
- Training Data
- Training Procedure
Evaluation
- Testing Data, Factors & Metrics
- Results
Citation
More Information
Model Card Authors
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

We provide three models down below:

1-6 layer-wise (Folder Name: models/1-6_layer-wise)
Fine-Tuned only 1st-6th layers in Encoder on CSJ.
TAPT512 60k (Folder Name: models/tapt512_60k)
Fine-Tuned on CSJ.
DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
Fine-Tuned on the diet record and CSJ.

Model Information

Model type: Language model
Language(s) (NLP): ja
License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)

Training Details

Training Data

1-6 layer-wise: CSJ
TAPT512 60K: CSJ
DAPT128-TAPT512: The Japanese diet record and CSJ

Training Procedure

We continuously train the pre-trained Japanese BERT model (cl-tohoku/bert-base-japanese-whole-word-masking; written BERT).

In detail, see Japanese blog or Japanese paper.

Evaluation

Testing Data, Factors & Metrics

Testing Data

We use CSJ for the evaluation.

Factors

We evaluate the following tasks on CSJ:

Dependency Parsing
Sentence Boundary
Important Sentence Extraction

Metrics

Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
Sentence Boundary: F1 Score
Important Sentence Extraction: F1 Score

Results

	Dependency Parsing	Sentence Boundary	Important Sentence Extraction
written BERT	39.4	61.6	36.8
1-6 layer wise	44.6	64.8	35.4
TAPT 512 60K	-	-	40.2
DAPT128-TAPT512	42.9	64.0	39.7

Citation

BibTeX:

@inproceedings{csjbert2021,
    title = {CSJを用いた日本語話し言葉BERTの作成},
    author = {勝又智 and 坂田大直},
    booktitle = {言語処理学会第27回年次大会},
    year = {2021},
}

More Information

https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)

Model Card Authors

Satoru Katsumata

Model Card Contact

pr@retrieva.jp

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.

python download_wikipedia_bert.py

This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese

Run sample_mlm.py to confirm you can use our models.

python sample_mlm.py