# Arabic text classification using deep learning (ArabicT5)

# Our experiment

The category mapping: category_mapping = { 'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7 }
Training parameters | | |

| :-------------------: | :-----------:| | Training batch size | 8 | | Evaluation batch size | 8 | | Learning rate | 1e-4 | | Max length input | 200 | | Max length target | 3 | | Number workers | 4 | | Epoch | 2 | | | |

Results | | |

| :---------------------: | :-----------: | | Validation Loss | 0.0479 |
| Accuracy | 96.49% | | BLeU | 96.49% |

# SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

# Arabic text classification using deep learning models

Paper [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]
Their experiment' "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." | Model | Accuracy |

| :---------------------: | :---------------------: | | CGRU | 93.43% |
| HANGRU | 95.81% |

# Example usage

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه  متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
                    truncation=True,
                    padding="max_length",
                    return_tensors="pt"
                )

output= model.generate(tokens['input_ids'],
                       max_length=3,
                       length_penalty=10)

output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output

['5']