TinyLlama + Japanese
A continual pretraining model of TinyLlama 1.1B with a few Japanese texts.
Base Model
TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
Tokenizers
(elyza/ELYZA-japanese-Llama-2-7b)[https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b]
Training Dataset
Around 9B tokens in total.
- izumi-lab/wikipedia-ja-20230720
- if001/oscar_2023_filtered
Validation Dataset
- izumi-lab/wikinews-ja-20230728
- izumi-lab/wikinews-en-20230728
- if001/aozorabunko-clean-sin
Evaluation
We did not perform.
Acknowledgement
We acknowledge those who prepared valuable datasets and lit-gpt.