hkiyomaru commited on
Commit
35675cd
1 Parent(s): 75fb8d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -99,7 +99,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
99
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
100
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
101
  - **Training data:** A subset of the datasets for model pre-training
102
- - **Vocabulary size:** 48,588 (mixed vocabulary of Japanese, English, and source code)
103
 
104
 
105
  ## Datasets
 
99
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
100
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
101
  - **Training data:** A subset of the datasets for model pre-training
102
+ - **Vocabulary size:** 97,024 (mixed vocabulary of Japanese, English, and source code)
103
 
104
 
105
  ## Datasets