metadata

license: apache-2.0
language:
  - ar
tags:
  - SP
  - Aranizer
  - Arabic Tokenizer

Aranizer | Arabic Tokenizer

Aranizer is an Arabic SentencePiece-based tokenizer designed for efficient and versatile tokenization. It features a vocabulary size of 32,000 tokens and is optimized for a fertility score of 1.803. The total number of tokens processed is 1,387,929, making it suitable for a wide range of NLP tasks.

Features

Tokenizer Name: Aranizer
Type: SentencePiece tokenizer
Vocabulary Size: 32,000
Total Number of Tokens: 1,387,929
Fertility Score: 1.803
It supports Arabic Diacritization

How to Use the Aranizer Tokenizer

The Aranizer tokenizer can be easily loaded using the transformers library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:

from transformers import AutoTokenizer

# Load the Aranizer tokenizer
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")

# Example usage
text = "اكتب النص العربي"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Citation

@article{koubaa2024arabiangpt, title={ArabianGPT: Native Arabic GPT-based Large Language Model}, author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry}, year={2024}, publisher={Preprints} }