|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ar |
|
tags: |
|
- Arabic BERT |
|
- Saudi Dialect |
|
- Twitter |
|
- Masked Langauge Model |
|
widget: |
|
- text: "اللي ما يعرف الصقر [MASK]." |
|
|
|
--- |
|
|
|
--- |
|
|
|
--- |
|
|
|
|
|
**SaudiBERT** is the first pre-trained large language model focused exclusively on Saudi dialect text. The model was pretrained on two large-scale corpora: the Saudi Tweets Mega Corpus (STMC), which contains +141 million tweets, and the Saudi Forum Corpus, which includes +70 million sentences collected from various Saudi online forums. The datasets comprise **26.3GB of text**. The code files along with the results are available on [repo](https://github.com/FaisalQarah/SaudiBERT). |
|
|
|
|
|
|
|
# BibTex |
|
|
|
If you use SaudiBERT model in your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (citation details to be updated): |
|
```bibtex |
|
@article{qarah2024saudibert, |
|
title={SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora}, |
|
author={Qarah, Faisal}, |
|
journal={arXiv preprint arXiv:2405.06239}, |
|
year={2024} |
|
} |
|
|
|
|
|
``` |