Edit model card

Mana Tokenizer

The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.

Quick Start

You can encode/decode your data using Mana Tokenizer like this:

from mana_tokenizer import ManaTokenizer
tokenizer = ManaTokenizer()
text = "سلام من یک متن تست برای تست این تست هستم."
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))

this is the normal encoding of this text:

[216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46]
سلام من یک متن تست برای تست این تست هستم.

and here is what Mana tokenizer generate:

[30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46]
سلام من یک متن تست برای تست این تست هستم.

You can also add special tokens:

tokenizer.register_special_tokens({"</s>": 100269})

Batch encode:

tokenizer.batch_encode(["یک متن طولانی"])

Benchmark

  • Benchmark DateTime: 2024-11-06 16:12:50
  • Mana Batch Encode Time: 0.10711932182312012 seconds
  • Mana Batch Encode Memory Usage: 13.203125 KB
  • Total characters in benchmark: 131000

Special Tokens

  • user Token: <|user|>
  • assistant Token: <|assistant|>
  • end Token: <|end|>
  • system Token: <|system|>

Statistics

  • Model Type: BPE
  • Vocabulary Size: 265,703
  • Character Coverage: 99.9%
  • Total Number of Text Samples: 1,147,036
  • Total Number of Tokens: 1,490,338
  • Average Token Length: 4.51
  • Corpus Size (in bytes): 1,792,210,410

Training Details

  • Training Data: Mana Persian corpus
  • Training Script: Mana Trainer
  • Script Version: 1.2

License

Mana tokenizer is licensed under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .