# Arabic text classification using deep learning (ArabicT5)
# Our experiment
The category mapping: category_mapping = { 'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7 }
Training parameters | | |
| :-------------------: | :-----------:|
| Training batch size | 8
|
| Evaluation batch size | 8
|
| Learning rate | 1e-4
|
| Max length input | 200
|
| Max length target | 3
|
| Number workers | 4
|
| Epoch | 2
|
| | |
- Results | | |
| :---------------------: | :-----------: |
| Validation Loss | 0.0479
|
| Accuracy | 96.49%
|
| BLeU | 96.49%
|
# SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
# Arabic text classification using deep learning models
Paper [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]
Their experiment' "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." | Model | Accuracy |
| :---------------------: | :---------------------: |
| CGRU | 93.43% |
| HANGRU | 95.81% |
# Example usage
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
truncation=True,
padding="max_length",
return_tensors="pt"
)
output= model.generate(tokens['input_ids'],
max_length=3,
length_penalty=10)
output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output
['5']
- Downloads last month
- 27