library_name: transformers
license: apache-2.0
language:
- ja
- en
RetrievaBERT Model
The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM. It is designed for use in Japanese.
What's New
- November 2024 (
v1.0.1
): Bug fix for the model parameters.- The up_proj's bias was initialized with the gate's one. This bug was fixed.
Model Details
Model Description
The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
This model offers several advanced features compared to traditional BERT models:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
- Max Sequence Length: 2048 tokens, allowing for longer context.
- Parameters: 1.3 billion parameters.
- Pre-training Objective: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- Token Type IDs: Not used in this model.
Model Sources
- Developed by: Retrieva, Inc.
- Model type: Based on MegatronBERT Architecture.
- Language(s) (NLP): Primarily Japanese (optional support for English).
- License: Apache 2.0
Uses
This model can be used as a Masked Language Model (MLM). However, it is primarily intended to be fine-tuned on downstream tasks. Depending on your use case, follow the appropriate section below.
Direct Use
This model is pre-trained using Masked Language Modeling.
The mask token used is <MASK|LLM-jp>
.
Note that you need to set trust_remote_code
to True
because RetrievaBERT uses a custom model implementation.
Example code for direct use:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "こんにちは!私の名前は<MASK|LLM-jp>です!"
print(pipe(text))
Downstream Use
RetrievaBERT is compatible with Hugging Face's AutoModels. To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class. For detailed configuration, refer to the config.json file.
Training Details
Training Data
The RetrievaBERT model was pre-trained on the reunion of five datasets:
- Japanese CommonCrawl Dataset by LLM-jp.
- RefinedWeb.
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- The Stack
The model was trained on 180 billion tokens using the above dataset.
Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.
- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.
Training Hyperparameters
The model was trained on the following hyperparameters.
- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16
Evaluation
We fine-tuned the following models and evaluated them on the JGLUE development set. We adjusted the learning rate and training epochs for each model and task in accordance with the JGLUE paper.
Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
---|---|---|---|---|---|---|---|
tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
tohoku-nlp/bert-large-japanese-v2 | 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
ku-nlp/deberta-v3-base-japanese | 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
retrieva-jp/bert-1.3b | 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
Technical Specifications
Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:
- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048
As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
Compute Infrastructure
This model is based on results obtained from the TSUBAME deep-learning mini-camp.
Software
The model was trained using Megatron-LM.
More Information
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba