Baligh: Dialect to MSA Translation Model Overview
Baligh is dedicated to advancing the translation of Arabic dialects to Modern Standard Arabic (MSA) using state-of-the-art language models. Developed by a collaborative effort among experts in Arabic linguistics and AI, this model aims to bridge the linguistic gap between the diverse dialects spoken across the Arab world and the standardized form of Arabic.
Model description
This model, named Fasih, represents a significant advancement in the field of Natural Language Processing (NLP) for the Arabic language, specifically in translating various Arabic dialects to Modern Standard Arabic (MSA). It is based on the fine-tuning of AraT5v2, a state-of-the-art transformer model, enhanced to understand and translate more than 25 distinct Arabic dialects.
Fine-tuning Details
The fine-tuning process was designed to capture the nuances of each dialect. Leveraging a diverse dataset of MADAR Corpus comprising over 100K samples, we ensured a broad representation of dialects from across the Arab world as shown in the table:
Region | Sub-region | Cities |
---|---|---|
Maghreb | Morocco | Rabat (RAB), Fes (FES) |
Algeria | Algiers (ALG) | |
Tunisia | Tunis (TUN), Sfax (SFX) | |
Libya | Tripoli (TRI), Benghazi (BEN) | |
Nile Basin | Egypt/Sudan | Cairo (CAI), Alexandria (ALX), Aswan (ASW), Khartoum (KHA) |
Levant | South Levant | Jerusalem (JER), Amman (AMM), Salt (SAL) |
North Levant | Beirut (BEI), Damascus (DAM), Aleppo (ALE) | |
Gulf | Iraq | Mosul (MOS), Baghdad (BAG), Basra (BAS) |
Yemen | Sana’a (SAN) | |
Gulf | Doha (DOH), Muscat (MUS), Riyadh (RIY), Jeddah (JED) |
Citations
The dataset used for training the Project Fasih model includes data from the following source:
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., ... & Oflazer, K. (2018, May). The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
This corpus has been instrumental in understanding and translating the nuances of over 25 Arabic dialects into Modern Standard Arabic (MSA), aiding significantly in the development and refinement of our model.
Acknowledgments
Special thanks to Prince Sultan University, particularly the Robotics and Internet of Things Lab.
Contact Information
For inquiries: riotu@psu.edu.sa.
Disclaimer for the Use of Baligh
We disclaim all responsibility for any inaccuracies or inappropriate content generated by the model. Users should apply the model's outputs at their own risk. Further improvements to enhance its performance are underway.
- Downloads last month
- 18