---
license: cc-by-nc-4.0
---

---

---

---


**EgyBERT** is a large language model focused exclusively on Egyptian dialectal texts. The model was pretrained on two large-scale corpora: the Egyptian Tweets Corpus (ETC), which contains +34 million tweets, and the Egyptian Forum Corpus, which includes +44 million sentences collected from various online forums. The datasets comprise **10.4GB of text**. The code files along with the results are available on [repo](https://github.com/FaisalQarah/EgyBERT).


# BibTex

If you use EgyBERT model in your scientific publication, or if you find the resources in this repository useful, Kindly cite our paper as follows (citation details to be updated):
```bibtex
@article{qarah2024egybert,
  title={EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora},
  author={Qarah, Faisal},
  journal={arXiv preprint arXiv:2408.03524},
  year={2024}
}


```