update readme.md
Browse files
README.md
CHANGED
@@ -19,6 +19,7 @@ tags:
|
|
19 |
- **Vocabulary Size**: 32,000
|
20 |
- **Total Number of Tokens**: 1,387,929
|
21 |
- **Fertility Score**: 1.803
|
|
|
22 |
|
23 |
## How to Use the Aranizer Tokenizer
|
24 |
|
@@ -31,7 +32,7 @@ from transformers import AutoTokenizer
|
|
31 |
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
|
32 |
|
33 |
# Example usage
|
34 |
-
text = "
|
35 |
tokens = tokenizer.tokenize(text)
|
36 |
token_ids = tokenizer.convert_tokens_to_ids(tokens)
|
37 |
|
@@ -39,3 +40,12 @@ print("Tokens:", tokens)
|
|
39 |
print("Token IDs:", token_ids)
|
40 |
```
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
- **Vocabulary Size**: 32,000
|
20 |
- **Total Number of Tokens**: 1,387,929
|
21 |
- **Fertility Score**: 1.803
|
22 |
+
- It supports Arabic Diacritization
|
23 |
|
24 |
## How to Use the Aranizer Tokenizer
|
25 |
|
|
|
32 |
tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
|
33 |
|
34 |
# Example usage
|
35 |
+
text = "اكتب النص العربي"
|
36 |
tokens = tokenizer.tokenize(text)
|
37 |
token_ids = tokenizer.convert_tokens_to_ids(tokens)
|
38 |
|
|
|
40 |
print("Token IDs:", token_ids)
|
41 |
```
|
42 |
|
43 |
+
## Citation
|
44 |
+
|
45 |
+
@article{koubaa2024arabiangpt,
|
46 |
+
title={ArabianGPT: Native Arabic GPT-based Large Language Model},
|
47 |
+
author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
|
48 |
+
year={2024},
|
49 |
+
publisher={Preprints}
|
50 |
+
}
|
51 |
+
|