EgyBERT is a large language model focused exclusively on Egyptian dialectal texts. The model was pretrained on two large-scale corpora: the Egyptian Tweets Corpus (ETC), which contains +34 million tweets, and the Egyptian Forum Corpus, which includes +44 million sentences collected from various online forums. The datasets comprise 10.4GB of text. The code files along with the results are available on repo.

BibTex

If you use EgyBERT model in your scientific publication, or if you find the resources in this repository useful, Kindly cite our paper as follows (citation details to be updated):

@article{qarah2024egybert,
  title={EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora},
  author={Qarah, Faisal},
  journal={arXiv preprint arXiv:2408.03524},
  year={2024}
}