metadata
license: apache-2.0
language:
- ar
tags:
- tokenizer
- PBE
Aranizer | Arabic Tokenizer
Aranizer is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
Features
- Tokenizer Name: Aranizer
- Type: PBE tokenizer
- Vocabulary Size: 32,000
- Total Number of Tokens: 1,520,791
- Fertility Score: 1.975
- It supports Arabic Diacritization
How to Use the Aranizer Tokenizer
The Aranizer tokenizer can be easily loaded using the transformers
library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
from transformers import AutoTokenizer
# Load the Aranizer tokenizer
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-32k")
# Example usage
text = "اكتب النص العربي"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
## Citation
@article{koubaa2024arabiangpt,
title={ArabianGPT: Native Arabic GPT-based Large Language Model},
author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
year={2024},
publisher={Preprints}
}