README.md · IMJONEZZ/SlovenBERTcina at main

#Slovak RoBERTA Masked Language Model

###83Mil Parameters in small model

Medium and Large models coming soon!

RoBERTA pretrained tokenizer vocab and merges included.

##Training params:

Dataset: 8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
Preprocessing: Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
Evaluation results:
- Mnoho ľudí tu MASK
  - žije.
  - žijú.
  - je.
  - trpí.
- Ako sa MASK
  - máte
  - máš
  - má
  - hovorí
- Plážová sezóna pod Zoborom patrí medzi MASK obdobia.
  - ročné
  - najkrajšie
  - najobľúbenejšie
  - najnáročnejšie
Limitations: The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.
Credit: If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.