bert-1.3b / README.md
Katsumata420's picture
Fix bugs for bias (#3)
a67b81e verified
---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---
# RetrievaBERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
## What's New
- November 2024 (`v1.0.1`): Bug fix for the model parameters.
- The up_proj's bias was initialized with the gate's one. This bug was fixed.
## Model Details
### Model Description
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.
- **SwiGLU**: Enhanced activation function for better performance.
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
- **Parameters**: 1.3 billion parameters.
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- **Token Type IDs**: Not used in this model.
### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0
## Uses
This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.
### Direct Use
This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
Example code for direct use:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "ใ“ใ‚“ใซใกใฏ๏ผ็งใฎๅๅ‰ใฏ<MASK|LLM-jp>ใงใ™๏ผ"
print(pipe(text))
```
### Downstream Use
RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.
## Training Details
### Training Data
The RetrievaBERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
The model was trained on 180 billion tokens using the above dataset.
### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.
- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.
#### Training Hyperparameters
The model was trained on the following hyperparameters.
- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16
## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
| :--- |---:|---:|---:|---:|---:|---:|---:|
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
| ku-nlp/deberta-v3-base-japaneseใ€€ใ€€ใ€€ใ€€| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
| retrieva-jp/bert-1.3bใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€| 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
## Technical Specifications
### Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:
- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048
As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
### Compute Infrastructure
[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)
This model is based on results obtained from the [TSUBAME deep-learning mini-camp](https://www.t4.gsic.titech.ac.jp/en/minicamp-dl-202406).
#### Software
The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
## More Information
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
## Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
## Model Card Contact
pr@retrieva.jp