AmhT5 Tokenizer
A T5 Tokenizer trained for the Amharic language.
The tokenizer has a Fertility rate: 1.8328
Notebook used for training: https://colab.research.google.com/drive/1B-pca9jpadTHz9WYTWXzPM-A1cTaltYo#scrollTo=wLslLc0D6TnA
Model Details
Model Description
An MT5Tokenizer based Amharic and English tokenizer trained using Fineweb and Wura datasets.
This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English.
To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as google/mt5-small
.
MT5 Tokenizer Vs AmhT5 Tokenizer
from transformers import MT5TokenizerFast
mt5 = "google/mt5-small"
TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False)
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α")
print(len(tokens)) # 20
print(tokens)
# ['βα¨α', 'α²', 'α', 'α', 'βα ', 'α
α', 'α₯', 'β', 'α', 'α', 'α΅', 'β', 'αα', 'βα α', 'α΅', 'α', 'α', 'α', 'βα¨α°', 'α']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 11
print(tokens)
# ['βA', 'β', 'Token', 'izer', 'βtrain', 'ed', 'βfor', 'βAm', 'haric', 'βlanguage', '.']
amhT5 = "yonas/AmhT5-tokenizer"
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α")
print(len(tokens)) # 11
print(tokens)
# ['βα¨', 'αα²α', 'α', 'βα ', 'α
αα₯', 'β', 'ααα΅', 'βαα', 'βα αα΅', 'ααα', 'βα¨α°α']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 7
print(tokens)
# ['βA', 'βToken', 'izer', 'βtrained', 'βfor', 'βAmharic', 'βlanguage.']