File size: 3,739 Bytes
be4ec7b 73574a5 be4ec7b 73574a5 be4ec7b 73574a5 be4ec7b 7d426c0 be4ec7b 6c7ec60 be4ec7b 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 73574a5 6c7ec60 be4ec7b 2c51d12 be4ec7b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
language: ja
thumbnail: https://github.com/rinnakk/japanese-gpt2/blob/master/rinna.png
tags:
- ja
- japanese
- roberta
- masked-lm
- nlp
license: mit
datasets:
- cc100
- wikipedia
widget:
- text: "[CLS]4年に1度[MASK]は開かれる。"
mask_token: "[MASK]"
---
# japanese-roberta-base
![rinna-icon](./rinna.png)
This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository [rinnakk/japanese-pretrained-models](https://github.com/rinnakk/japanese-pretrained-models) by [rinna Co., Ltd.](https://corp.rinna.co.jp/)
# How to load the model
*NOTE:* Use `T5Tokenizer` to initiate the tokenizer.
~~~~
from transformers import T5Tokenizer, RobertaForMaskedLM
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True # due to some bug of tokenizer config loading
model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")
~~~~
# How to use the model for masked token prediction
## Note 1: Use `[CLS]`
To predict a masked token, be sure to add a `[CLS]` token before the sentence for the model to correctly encode it, as it is used during the model training.
## Note 2: Use `[MASK]` after tokenization
A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions.
## Example
Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API.
~~~~
# original text
text = "4年に1度オリンピックは開かれる。"
# prepend [CLS]
text = "[CLS]" + text
# tokenize
tokens = tokenizer.tokenize(text)
print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']
# mask a token
masked_idx = 6
tokens[masked_idx] = tokenizer.mask_token
print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']
# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids) # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]
# convert to tensor
import torch
token_tensor = torch.tensor([token_ids])
# get the top 10 predictions of the masked token
model = model.eval()
with torch.no_grad():
outputs = model(token_tensor)
predictions = outputs[0][0, masked_idx].topk(10)
for i, index_t in enumerate(predictions.indices):
index = index_t.item()
token = tokenizer.convert_ids_to_tokens([index])[0]
print(i, token)
"""
0 ワールドカップ
1 フェスティバル
2 オリンピック
3 サミット
4 東京オリンピック
5 総会
6 全国大会
7 イベント
8 世界選手権
9 パーティー
"""
~~~~
# Model architecture
A 12-layer, 768-hidden-size transformer-based masked language model.
# Training
The model was trained on [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/) to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.
# Tokenization
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.
# Licenese
[The MIT license](https://opensource.org/licenses/MIT)
|