SlEng-bert

SlEng-bert is a bilingual, Slovene-English masked language model.

SlEng-bert was trained from scratch on Slovene and English, conversational, non-standard, and slang language. The model has 12 transformer layers, and is roughly equal in size to BERT and RoBERTa base models. The pre-training task used was masked language modeling, with no other tasks (like NSP).

The tokenizer and corpora used to train SlEng-bert were also used for training the SloBERTa-SlEng model. The difference between the two is: SlEng-bert was trained from scratch for 40 epochs; SloBERTa-SlEng is SloBERTa further pre-trained for 2 epochs on new corpora.

Training corpora

The model was trained on English and Slovene tweets, Slovene corpora MaCoCu and Frenk, and a small subset of English Oscar corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words.