metadata

language: ja
thumbnail: https://github.com/rinnakk/japanese-gpt2/blob/master/rinna.png
tags:
  - ja
  - japanese
  - roberta
  - masked-lm
  - nlp
license: mit
datasets:
  - cc100
  - wikipedia
widget:
  - text: '[CLS]4年に1度[MASK]は開かれる。'
mask_token: '[MASK]'

japanese-roberta-base

This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd.

How to load the model

NOTE: Use T5Tokenizer to initiate the tokenizer.

from transformers import T5Tokenizer, RobertaForMaskedLM

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")

How to use the model for masked token prediction

Note 1: Use `[CLS]`

To predict a masked token, be sure to add a [CLS] token before the sentence for the model to correctly encode it, as it is used during the model training.

Note 2: Use `[MASK]` after tokenization

A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use [MASK] after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing [MASK] in the input string and produces less robust predictions.

Example

Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API.

# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']

# mask a token
masked_idx = 6
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
import torch
token_tensor = torch.tensor([token_ids])

# get the top 10 predictions of the masked token
model = model.eval()
with torch.no_grad():
    outputs = model(token_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

"""
0 ワールドカップ
1 フェスティバル
2 オリンピック
3 サミット
4 東京オリンピック
5 総会
6 全国大会
7 イベント
8 世界選手権
9 パーティー
"""

Model architecture

A 12-layer, 768-hidden-size transformer-based masked language model.

Training

The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.

Tokenization

The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.

Licenese

The MIT license