riotu-lab commited on
Commit
3d6691c
1 Parent(s): af3e973

Update readme.md

Browse files
Files changed (1) hide show
  1. README.md +41 -3
README.md CHANGED
@@ -1,3 +1,41 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ tags:
6
+ - SP
7
+ - Aranizer
8
+ - Arabic Tokenizer
9
+ ---
10
+
11
+ # Aranizer | Arabic Tokenizer
12
+
13
+ **Aranizer** is an Arabic SentencePiece-based tokenizer designed for efficient and versatile tokenization. It features a vocabulary size of 32,000 tokens and is optimized for a fertility score of 1.803. The total number of tokens processed is 1,387,929, making it suitable for a wide range of NLP tasks.
14
+
15
+ ## Features
16
+
17
+ - **Tokenizer Name**: Aranizer
18
+ - **Type**: SentencePiece tokenizer
19
+ - **Vocabulary Size**: 32,000
20
+ - **Total Number of Tokens**: 1,387,929
21
+ - **Fertility Score**: 1.803
22
+
23
+ ## How to Use the Aranizer Tokenizer
24
+
25
+ The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer
29
+
30
+ # Load the Aranizer tokenizer
31
+ tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
32
+
33
+ # Example usage
34
+ text = "This is a sample text to tokenize."
35
+ tokens = tokenizer.tokenize(text)
36
+ token_ids = tokenizer.convert_tokens_to_ids(tokens)
37
+
38
+ print("Tokens:", tokens)
39
+ print("Token IDs:", token_ids)
40
+ ```
41
+