|
--- |
|
language: |
|
- fa |
|
- en |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- Tokenizer |
|
--- |
|
|
|
# Improved LLaMA 2 Tokenizer with Persian Language Support |
|
|
|
## Model Description |
|
|
|
This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages. |
|
|
|
### Key Features |
|
|
|
- Enhanced support for Persian language tokenization |
|
- Maintained multilingual capabilities of the original LLaMA 2 tokenizer |
|
- Improved handling of Persian-specific characters and word structures |
|
- Larger vocabulary size to accommodate Persian tokens |
|
|
|
## Training Data |
|
|
|
The tokenizer was created using the following steps: |
|
|
|
1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns. |
|
2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer. |
|
|
|
## Training Procedure |
|
|
|
1. Persian Wikipedia Tokenizer Training: |
|
- Corpus: Persian Wikipedia dump (specify date if available) |
|
- Tokenization algorithm: BPE |
|
- Vocabulary size: 5000 |
|
|
|
2. Merging with LLaMA 2 Tokenizer: |
|
- Base tokenizer: LLaMA 2 tokenizer |
|
- Final vocabulary size: 36954 |
|
|
|
## Usage |
|
|
|
To use this tokenizer with the Hugging Face Transformers library: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/llama2-persian-tokenizer") |
|
|
|
# Example usage |
|
text = "این یک مثال به زبان فارسی است." |
|
tokens = tokenizer(text) |
|
print(tokens) |