|
--- |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
tags: |
|
- tokenizer |
|
- PBE |
|
--- |
|
|
|
# Aranizer | Arabic Tokenizer |
|
|
|
**Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization. |
|
|
|
## Features |
|
|
|
- **Tokenizer Name**: Aranizer |
|
- **Type**: PBE tokenizer |
|
- **Vocabulary Size**: 32,000 |
|
- **Total Number of Tokens**: 1,520,791 |
|
- **Fertility Score**: 1.975 |
|
- It supports Arabic Diacritization |
|
|
|
## How to Use the Aranizer Tokenizer |
|
|
|
The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
# Load the Aranizer tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-32k") |
|
|
|
# Example usage |
|
text = "اكتب النص العربي" |
|
tokens = tokenizer.tokenize(text) |
|
token_ids = tokenizer.convert_tokens_to_ids(tokens) |
|
|
|
print("Tokens:", tokens) |
|
print("Token IDs:", token_ids) |
|
``` |
|
|
|
```markdown |
|
## Citation |
|
|
|
@article{koubaa2024arabiangpt, |
|
title={ArabianGPT: Native Arabic GPT-based Large Language Model}, |
|
author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry}, |
|
year={2024}, |
|
publisher={Preprints} |
|
} |
|
|