File size: 3,739 Bytes
be4ec7b
 
 
 
 
 
 
 
 
 
 
 
 
73574a5
 
 
be4ec7b
 
 
 
 
 
73574a5
be4ec7b
73574a5
be4ec7b
 
 
 
7d426c0
be4ec7b
6c7ec60
be4ec7b
 
6c7ec60
 
 
 
 
73574a5
6c7ec60
73574a5
 
 
 
 
 
 
 
 
6c7ec60
 
 
73574a5
6c7ec60
 
 
 
 
 
73574a5
6c7ec60
 
73574a5
6c7ec60
73574a5
6c7ec60
 
 
73574a5
6c7ec60
 
 
 
 
73574a5
6c7ec60
 
 
73574a5
 
6c7ec60
 
 
 
 
 
73574a5
 
 
 
 
 
 
 
 
 
6c7ec60
be4ec7b
 
 
 
 
 
2c51d12
be4ec7b
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
language: ja
thumbnail: https://github.com/rinnakk/japanese-gpt2/blob/master/rinna.png
tags:
- ja
- japanese
- roberta
- masked-lm
- nlp
license: mit
datasets:
- cc100
- wikipedia
widget:
- text: "[CLS]4年に1度[MASK]は開かれる。"
mask_token: "[MASK]"
---

# japanese-roberta-base

![rinna-icon](./rinna.png)

This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository [rinnakk/japanese-pretrained-models](https://github.com/rinnakk/japanese-pretrained-models) by [rinna Co., Ltd.](https://corp.rinna.co.jp/)

# How to load the model

*NOTE:* Use `T5Tokenizer` to initiate the tokenizer.

~~~~
from transformers import T5Tokenizer, RobertaForMaskedLM

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")
~~~~

# How to use the model for masked token prediction

## Note 1: Use `[CLS]`

To predict a masked token, be sure to add a `[CLS]` token before the sentence for the model to correctly encode it, as it is used during the model training.

## Note 2: Use `[MASK]` after tokenization

A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions.

## Example

Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API. 

~~~~
# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']

# mask a token
masked_idx = 6
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
import torch
token_tensor = torch.tensor([token_ids])

# get the top 10 predictions of the masked token
model = model.eval()
with torch.no_grad():
    outputs = model(token_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

"""
0 ワールドカップ
1 フェスティバル
2 オリンピック
3 サミット
4 東京オリンピック
5 総会
6 全国大会
7 イベント
8 世界選手権
9 パーティー
"""
~~~~

# Model architecture
A 12-layer, 768-hidden-size transformer-based masked language model.

# Training
The model was trained on [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/jawiki/) to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.

# Tokenization
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.

# Licenese
[The MIT license](https://opensource.org/licenses/MIT)