EgyBERT is a large language model focused exclusively on Egyptian dialectal texts. The model was pretrained on two large-scale corpora: the Egyptian Tweets Corpus (ETC), which contains +34 million tweets, and the Egyptian Forum Corpus, which includes +44 million sentences collected from various online forums. The datasets comprise 10.4GB of text. The code files along with the results are available on repo.
BibTex
If you use EgyBERT model in your scientific publication, or if you find the resources in this repository useful, Kindly cite our paper as follows (citation details to be updated):
@article{qarah2024egybert,
title={EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora},
author={Qarah, Faisal},
journal={arXiv preprint arXiv:2408.03524},
year={2024}
}
- Downloads last month
- 25
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.