amirakhlaghiqqq
/

persian-llama2

Inference Endpoints

Model card Files Files and versions Community

persian-llama2 / README.md

amirakhlaghiqqq's picture

amirakhlaghiqqq

Upload tokenizer

66de022 verified 5 months ago

|

1.71 kB

	---
	language:
	- fa
	- en
	license: mit
	library_name: transformers
	tags:
	- Tokenizer
	---

	# Improved LLaMA 2 Tokenizer with Persian Language Support

	## Model Description

	This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.

	### Key Features

	- Enhanced support for Persian language tokenization
	- Maintained multilingual capabilities of the original LLaMA 2 tokenizer
	- Improved handling of Persian-specific characters and word structures
	- Larger vocabulary size to accommodate Persian tokens

	## Training Data

	The tokenizer was created using the following steps:

	1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
	2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.

	## Training Procedure

	1. Persian Wikipedia Tokenizer Training:
	- Corpus: Persian Wikipedia dump (specify date if available)
	- Tokenization algorithm: BPE
	- Vocabulary size: 5000

	2. Merging with LLaMA 2 Tokenizer:
	- Base tokenizer: LLaMA 2 tokenizer
	- Final vocabulary size: 36954

	## Usage

	To use this tokenizer with the Hugging Face Transformers library:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("your-username/llama2-persian-tokenizer")

	# Example usage
	text = "این یک مثال به زبان فارسی است."
	tokens = tokenizer(text)
	print(tokens)