update vocabulary size
Browse files
README.md
CHANGED
@@ -100,7 +100,8 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
|
|
100 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
101 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
102 |
- **Training data:** A subset of the datasets for model pre-training
|
103 |
-
- **Vocabulary size:**
|
|
|
104 |
|
105 |
|
106 |
## Datasets
|
|
|
100 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
101 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
102 |
- **Training data:** A subset of the datasets for model pre-training
|
103 |
+
- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
|
104 |
+
- The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
|
105 |
|
106 |
|
107 |
## Datasets
|