Model Card for japanese-spoken-language-bert
日本語READMEはこちら
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.
We provide three models down below:
1-6 layer-wise (Folder Name: models/1-6_layer-wise)
Fine-Tuned only 1st-6th layers in Encoder on CSJ.TAPT512 60k (Folder Name: models/tapt512_60k)
Fine-Tuned on CSJ.DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
Fine-Tuned on the diet record and CSJ.
Table of Contents
- Model Card for japanese-spoken-language-bert
- Table of Contents
- Model Details
- Training Details
- Evaluation
- Citation
- More Information
- Model Card Authors
- Model Card Contact
- How to Get Started with the Model
Model Details
Model Description
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. We used CSJ and the Japanese diet record. CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). We only provide model parameters. You have to download other config files to use these models.
We provide three models down below:
1-6 layer-wise (Folder Name: models/1-6_layer-wise)
Fine-Tuned only 1st-6th layers in Encoder on CSJ.TAPT512 60k (Folder Name: models/tapt512_60k)
Fine-Tuned on CSJ.DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
Fine-Tuned on the diet record and CSJ.
Model Information
- Model type: Language model
- Language(s) (NLP): ja
- License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)
Training Details
Training Data
- 1-6 layer-wise: CSJ
- TAPT512 60K: CSJ
- DAPT128-TAPT512: The Japanese diet record and CSJ
Training Procedure
We continuously train the pre-trained Japanese BERT model (cl-tohoku/bert-base-japanese-whole-word-masking; written BERT).
In detail, see Japanese blog or Japanese paper.
Evaluation
Testing Data, Factors & Metrics
Testing Data
We use CSJ for the evaluation.
Factors
We evaluate the following tasks on CSJ:
- Dependency Parsing
- Sentence Boundary
- Important Sentence Extraction
Metrics
- Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
- Sentence Boundary: F1 Score
- Important Sentence Extraction: F1 Score
Results
Dependency Parsing | Sentence Boundary | Important Sentence Extraction | |
---|---|---|---|
written BERT | 39.4 | 61.6 | 36.8 |
1-6 layer wise | 44.6 | 64.8 | 35.4 |
TAPT 512 60K | - | - | 40.2 |
DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 |
Citation
BibTeX:
@inproceedings{csjbert2021,
title = {CSJを用いた日本語話し言葉BERTの作成},
author = {勝又智 and 坂田大直},
booktitle = {言語処理学会第27回年次大会},
year = {2021},
}
More Information
https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)
Model Card Authors
Satoru Katsumata
Model Card Contact
How to Get Started with the Model
Use the code below to get started with the model.
Click to expand
- Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.
python download_wikipedia_bert.py
This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. https://github.com/cl-tohoku/bert-japanese
- Run sample_mlm.py to confirm you can use our models.
python sample_mlm.py