riotu-lab commited on
Commit
d5fae53
1 Parent(s): 3d6691c

update readme.md

Browse files
Files changed (1) hide show
  1. README.md +11 -1
README.md CHANGED
@@ -19,6 +19,7 @@ tags:
19
  - **Vocabulary Size**: 32,000
20
  - **Total Number of Tokens**: 1,387,929
21
  - **Fertility Score**: 1.803
 
22
 
23
  ## How to Use the Aranizer Tokenizer
24
 
@@ -31,7 +32,7 @@ from transformers import AutoTokenizer
31
  tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
32
 
33
  # Example usage
34
- text = "This is a sample text to tokenize."
35
  tokens = tokenizer.tokenize(text)
36
  token_ids = tokenizer.convert_tokens_to_ids(tokens)
37
 
@@ -39,3 +40,12 @@ print("Tokens:", tokens)
39
  print("Token IDs:", token_ids)
40
  ```
41
 
 
 
 
 
 
 
 
 
 
 
19
  - **Vocabulary Size**: 32,000
20
  - **Total Number of Tokens**: 1,387,929
21
  - **Fertility Score**: 1.803
22
+ - It supports Arabic Diacritization
23
 
24
  ## How to Use the Aranizer Tokenizer
25
 
 
32
  tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
33
 
34
  # Example usage
35
+ text = "اكتب النص العربي"
36
  tokens = tokenizer.tokenize(text)
37
  token_ids = tokenizer.convert_tokens_to_ids(tokens)
38
 
 
40
  print("Token IDs:", token_ids)
41
  ```
42
 
43
+ ## Citation
44
+
45
+ @article{koubaa2024arabiangpt,
46
+ title={ArabianGPT: Native Arabic GPT-based Large Language Model},
47
+ author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
48
+ year={2024},
49
+ publisher={Preprints}
50
+ }
51
+