Model Card for NADI-2024-baseline
A BERT-based model fine-tuned to perform single-label Arabic Dialect Identification (ADI). Instead of predicting the most probable dialect, the logits are used to generate multilabel predictions.
Model Description
- Model type: A Dialect Identification model fine-tuned on the training sets of: NADI2020,2021,2023 and MADAR 2018.
- Language(s) (NLP): Arabic.
- Finetuned from model : MarBERTv2
Multilabel country-level Dialect Identification
Baseline I (Top 90%)
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
DIALECTS = ["Algeria",
"Bahrain",
"Egypt",
"Iraq",
"Jordan",
"Kuwait",
"Lebanon",
"Libya",
"Morocco",
"Oman",
"Palestine",
"Qatar",
"Saudi_Arabia",
"Sudan",
"Syria",
"Tunisia",
"UAE",
"Yemen",
]
assert len(DIALECTS) == 18
MODEL_NAME = "AMR-KELEG/NADI2024-baseline"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
def predict_top_p(text, P=0.9):
"""Predict the top dialects with an accumulative confidence of at least P."""
assert P <= 1 and P >= 0
logits = model(**tokenizer(text, return_tensors="pt")).logits
probabilities = torch.softmax(logits, dim=1).flatten().tolist()
topk_predictions = torch.topk(logits, 18).indices.flatten().tolist()
predictions = [0 for _ in range(18)]
total_prob = 0
for i in range(18):
total_prob += probabilities[topk_predictions[i]]
predictions[topk_predictions[i]] = 1
if total_prob >= P:
break
return [DIALECTS[i] for i, p in enumerate(predictions) if p == 1]
s1 = "كيفك يا زلمة"
s1_pred = predict_top_p(s1) # ['Jordan', 'Lebanon', 'Palestine', 'Syria']
print(s1, s1_pred)
s2 = "خليلي في مساج بريفي كيفاش الاتصال"
s2_pred = predict_top_p(s2) # ['Algeria', 'Tunisia']
print(s2, s2_pred)
Citation
If you find the model useful, please cite the following respective paper:
@inproceedings{abdul-mageed-etal-2024-nadi,
title = "{NADI} 2024: The Fifth Nuanced {A}rabic Dialect Identification Shared Task",
author = "Abdul-Mageed, Muhammad and
Keleg, Amr and
Elmadany, AbdelRahim and
Zhang, Chiyu and
Hamed, Injy and
Magdy, Walid and
Bouamor, Houda and
Habash, Nizar",
editor = "Habash, Nizar and
Bouamor, Houda and
Eskander, Ramy and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Abdelali, Ahmed and
Touileb, Samia and
Hamed, Injy and
Onaizan, Yaser and
Alhafni, Bashar and
Antoun, Wissam and
Khalifa, Salam and
Haddad, Hatem and
Zitouni, Imed and
AlKhamissi, Badr and
Almatham, Rawan and
Mrini, Khalil",
booktitle = "Proceedings of The Second Arabic Natural Language Processing Conference",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.arabicnlp-1.79",
pages = "709--728",
abstract = "We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI{'}s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.",
}
- Downloads last month
- 8,094
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.