riotu-lab commited on
Commit
a5f1035
1 Parent(s): 5bbca3c

update readme.md

Browse files
Files changed (1) hide show
  1. README.md +43 -1
README.md CHANGED
@@ -5,4 +5,46 @@ language:
5
  tags:
6
  - tokenizer
7
  - PBE
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - tokenizer
7
  - PBE
8
+ ---
9
+
10
+ # Aranizer | Arabic Tokenizer
11
+
12
+ **Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
13
+
14
+ ## Features
15
+
16
+ - **Tokenizer Name**: Aranizer
17
+ - **Type**: PBE tokenizer
18
+ - **Vocabulary Size**: 32,000
19
+ - **Total Number of Tokens**: 1,520,791
20
+ - **Fertility Score**: 1.975
21
+ - It supports Arabic Diacritization
22
+
23
+ ## How to Use the Aranizer Tokenizer
24
+
25
+ The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer
29
+
30
+ # Load the Aranizer tokenizer
31
+ tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-32k")
32
+
33
+ # Example usage
34
+ text = "اكتب النص العربي"
35
+ tokens = tokenizer.tokenize(text)
36
+ token_ids = tokenizer.convert_tokens_to_ids(tokens)
37
+
38
+ print("Tokens:", tokens)
39
+ print("Token IDs:", token_ids)
40
+ ```
41
+
42
+ ```markdown
43
+ ## Citation
44
+
45
+ @article{koubaa2024arabiangpt,
46
+ title={ArabianGPT: Native Arabic GPT-based Large Language Model},
47
+ author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
48
+ year={2024},
49
+ publisher={Preprints}
50
+ }