riotu-lab
/

Aranizer-SP-32k

Arabic Tokenizer

Model card Files Files and versions Community

riotu-lab commited on Aug 25

Commit

3d6691c

•

1 Parent(s): af3e973

Update readme.md

Files changed (1) hide show

README.md +41 -3

README.md CHANGED Viewed

@@ -1,3 +1,41 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ar
+tags:
+- SP
+- Aranizer
+- Arabic Tokenizer
+---
+# Aranizer | Arabic Tokenizer
+**Aranizer** is an Arabic SentencePiece-based tokenizer designed for efficient and versatile tokenization. It features a vocabulary size of 32,000 tokens and is optimized for a fertility score of 1.803. The total number of tokens processed is 1,387,929, making it suitable for a wide range of NLP tasks.
+## Features
+- **Tokenizer Name**: Aranizer
+- **Type**: SentencePiece tokenizer
+- **Vocabulary Size**: 32,000
+- **Total Number of Tokens**: 1,387,929
+- **Fertility Score**: 1.803
+## How to Use the Aranizer Tokenizer
+The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
+```python
+from transformers import AutoTokenizer
+# Load the Aranizer tokenizer
+tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-SP-32k")
+# Example usage
+text = "This is a sample text to tokenize."
+tokens = tokenizer.tokenize(text)
+token_ids = tokenizer.convert_tokens_to_ids(tokens)
+print("Tokens:", tokens)
+print("Token IDs:", token_ids)
+```