Mana Tokenizer
The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.
Quick Start
You can encode/decode your data using Mana Tokenizer like this:
from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))
this is the normal encoding of this text:
[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.
and here is what Mana tokenizer generate:
[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.
You can also add special tokens:
tokenizer.register_special_tokens({"</s>": 100269})
Batch encode:
tokenizer.batch_encode(["یک متن طولانی"])
Benchmark
- Benchmark DateTime: 2024-11-06 16:12:50
- Mana Batch Encode Time: 0.10711932182312012 seconds
- Mana Batch Encode Memory Usage: 13.203125 KB
- Total characters in benchmark: 131000
Special Tokens
- user Token:
<|user|>
- assistant Token:
<|assistant|>
- end Token:
<|end|>
- system Token:
<|system|>
Statistics
- Model Type: BPE
- Vocabulary Size: 265,703
- Character Coverage: 99.9%
- Total Number of Text Samples: 1,147,036
- Total Number of Tokens: 1,490,338
- Average Token Length: 4.51
- Corpus Size (in bytes): 1,792,210,410
Training Details
- Training Data: Mana Persian corpus
- Training Script: Mana Trainer
- Script Version: 1.2
License
Mana tokenizer is licensed under the MIT License.