README.md · UBC-NLP/MARBERT at bf419ae7722f233d64f3c046ba4327ed4785471c

MARBERT is one of two models described in the paper "ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic". MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA. Arabic has multiple varieties. To train MARBERT, we randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets. We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not. That is, we do not remove non-Arabic so long as the tweet meets the 3 Arabic word criterion. The dataset makes up 128GB of text (15.6B tokens). We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short. See our repo for modifying BERT code to remove NSP. For more information about MARBERT, please visit our own GitHub repo.