Model card for model ID

This is a T5 v1.1 model, pre-trained on a Japanese corpus.

Model details

T5 is a Transformer-based Encoder-Decoder model, now in v1.1, with the following improvements over the original T5.

GEGLU activation in feed-forward hidden layer, rather than ReLU - see https://arxiv.org/abs/2002.05202 .
Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.
no parameter sharing between embedding and classifier layer
"xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger d_model and smaller num_heads and d_ff.

This model is based on T5 v1.1. It was pre-trained on a Japanese corpus. For the Japanese corpus, Japanese Wikipedia and mC4/ja were used.

Developed by: Retrieva, Inc.
Model type: T5 v1.1
Language(s) (NLP): Japanese
License: CC-BY-SA 4.0 Although commercial use is permitted, we kindly request that you contact us beforehand.

We use T5X (https://github.com/google-research/t5x) for the training of this model, and it has been converted to the Huggingface transformer format.

The training data used is

The following filtering is done

Remove documents that do not use a single hiragana character. This removes English-only documents and documents in Chinese.
Whitelist-style filtering using the top level domain of URL to remove affiliate sites.

dropout rate: 0.0
batch size: 128
bf16
input length: 512
output length: 114
Otherwise, the default value of T5X (https://github.com/google-research/t5x/blob/main/t5x/examples/t5/t5_1_1/xl.gin) is followed, including the following.
- optimizer: Adafactor
- base_learning_rate: 1.0
- warmup steps: 10000

We trained 524288 steps.

Model architecture.

Google Cloud TPU v3-128.

Jiro Nishitoba