add tokenizer info
Browse files
README.md
CHANGED
@@ -96,12 +96,13 @@ print(tokenizer.decode(output))
|
|
96 |
## Tokenizer (To be updated)
|
97 |
|
98 |
The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
|
99 |
-
The vocabulary entries were converted from [`llm-jp-tokenizer v2.
|
100 |
-
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
|
|
|
101 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
102 |
-
- **Training algorithm:** SentencePiece Unigram byte-fallback
|
103 |
- **Training data:** A subset of the datasets for model pre-training
|
104 |
-
- **Vocabulary size:**
|
105 |
|
106 |
|
107 |
## Datasets (To be updated)
|
|
|
96 |
## Tokenizer (To be updated)
|
97 |
|
98 |
The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
|
99 |
+
The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
|
100 |
+
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
|
101 |
+
|
102 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
103 |
+
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
104 |
- **Training data:** A subset of the datasets for model pre-training
|
105 |
+
- **Vocabulary size:** 48,588 (mixed vocabulary of Japanese, English, and source code)
|
106 |
|
107 |
|
108 |
## Datasets (To be updated)
|