|
--- |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
tags: |
|
- SP |
|
- Aranizer |
|
- Arabic Tokenizer |
|
--- |
|
|
|
# Aranizer | Arabic Tokenizer |
|
|
|
**Aranizer** is an Arabic SentencePiece-based tokenizer designed for efficient and versatile tokenization. It features a vocabulary size of 32,000 tokens and is optimized for a fertility score of 1.803. The total number of tokens processed is 1,387,929, making it suitable for a wide range of NLP tasks. |
|
|
|
## Features |
|
|
|
- **Tokenizer Name**: Aranizer |
|
- **Type**: SentencePiece tokenizer |
|
- **Vocabulary Size**: 32,000 |
|
- **Total Number of Tokens**: 1,387,929 |
|
- **Fertility Score**: 1.803 |
|
- It supports Arabic Diacritization |
|
|
|
## How to Use the Aranizer Tokenizer |
|
|
|
The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
# Load the Aranizer tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k") |
|
|
|
# Example usage |
|
text = "اكتب النص العربي" |
|
tokens = tokenizer.tokenize(text) |
|
token_ids = tokenizer.convert_tokens_to_ids(tokens) |
|
|
|
print("Tokens:", tokens) |
|
print("Token IDs:", token_ids) |
|
``` |
|
|
|
## Citation |
|
|
|
@article{koubaa2024arabiangpt, |
|
title={ArabianGPT: Native Arabic GPT-based Large Language Model}, |
|
author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry}, |
|
year={2024}, |
|
publisher={Preprints} |
|
} |
|
|
|
|