llm-jp
/

llm-jp-13b-v2.0

Text Generation

text-generation-inference

Model card Files Files and versions Community

tathi commited on Apr 27

Commit

09210b8

•

1 Parent(s): 820cd7d

update vocabulary size

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -100,7 +100,8 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
-- **Vocabulary size:** 97,024 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets

 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
+- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
+  - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
 ## Datasets