riotu-lab
/

Aranizer-SP-32k

Arabic Tokenizer

Model card Files Files and versions Community

riotu-lab commited on Aug 25

Commit

d5fae53

•

1 Parent(s): 3d6691c

update readme.md

Files changed (1) hide show

README.md +11 -1

README.md CHANGED Viewed

@@ -19,6 +19,7 @@ tags:
 - **Vocabulary Size**: 32,000
 - **Total Number of Tokens**: 1,387,929
 - **Fertility Score**: 1.803
 ## How to Use the Aranizer Tokenizer
@@ -31,7 +32,7 @@ from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
 # Example usage
-text = "This is a sample text to tokenize."
 tokens = tokenizer.tokenize(text)
 token_ids = tokenizer.convert_tokens_to_ids(tokens)
@@ -39,3 +40,12 @@ print("Tokens:", tokens)
 print("Token IDs:", token_ids)
 ```

 - **Vocabulary Size**: 32,000
 - **Total Number of Tokens**: 1,387,929
 - **Fertility Score**: 1.803
+- It supports Arabic Diacritization
 ## How to Use the Aranizer Tokenizer
 tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
 # Example usage
+text = "اكتب النص العربي"
 tokens = tokenizer.tokenize(text)
 token_ids = tokenizer.convert_tokens_to_ids(tokens)
 print("Token IDs:", token_ids)
 ```
+## Citation
+@article{koubaa2024arabiangpt,
+  title={ArabianGPT: Native Arabic GPT-based Large Language Model},
+  author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
+  year={2024},
+  publisher={Preprints}
+}