--- library_name: transformers license: cc-by-4.0 datasets: - HuggingFaceFW/fineweb - castorini/wura language: - am - en --- # AmhT5 Tokenizer A T5 Tokenizer trained for the Amharic language. The tokenizer has a Fertility rate: 1.8328 Notebook used for training: https://colab.research.google.com/drive/1B-pca9jpadTHz9WYTWXzPM-A1cTaltYo#scrollTo=wLslLc0D6TnA ## Model Details ### Model Description An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets. This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English. To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`. ### MT5 Tokenizer Vs AmhT5 Tokenizer ```python from transformers import MT5TokenizerFast mt5 = "google/mt5-small" TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False) tokens = TOKENIZER.tokenize("ከመዲናዋ በቅርብ ርቀት ላይ በምትገኘው ከተማ") print(len(tokens)) # 20 print(tokens) # ['▁ከመ', 'ዲ', 'ና', 'ዋ', '▁በ', 'ቅር', 'ብ', '▁', 'ር', 'ቀ', 'ት', '▁', 'ላይ', '▁በም', 'ት', 'ገ', 'ኘ', 'ው', '▁ከተ', 'ማ'] tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.") print(len(tokens)) # 11 print(tokens) # ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.'] amhT5 = "yonas/AmhT5-tokenizer" TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False) tokens = TOKENIZER.tokenize("ከመዲናዋ በቅርብ ርቀት ላይ በምትገኘው ከተማ") print(len(tokens)) # 11 print(tokens) # ['▁ከ', 'መዲና', 'ዋ', '▁በ', 'ቅርብ', '▁', 'ርቀት', '▁ላይ', '▁በምት', 'ገኘው', '▁ከተማ'] tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.") print(len(tokens)) # 7 print(tokens) # ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.'] ```