The official NLPre-PL dataset - a uniformly paragraph-level divided version of NKJP1M corpus – the 1-million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Języka Polskiego).
The NLPre dataset aims at fairly dividing the paragraphs length-wise and topic-wise into train, development, and test sets. Thus, we ensure a similar number of segments distribution per paragraph and avoid the situation when paragraphs with a small (or large) number of segments are available only e.g. during test time.🤗 NLPre-PL Dataset 🤗 PDB-UD Dataset
Here are listed all available models, trained for the purpouse of creating NLPre-PL Benchmark.
UD TAGSET
COMBO + HerBERT + PDB-UD COMBO + fasttext + PDB-UD COMBO + HerBERT + NLPrePL-fair-by-name COMBO + HerBERT + NLPrePL-fair-by-type COMBO + fasttext + NLPrePL-fair-by-name COMBO + fasttext + NLPrePL-fair-by-type
NKJP TAGSET
COMBO + HerBERT + NLPrePL-fair-by-name COMBO + HerBERT + NLPrePL-fair-by-type COMBO + fasttext + NLPrePL-fair-by-name COMBO + fasttext + NLPrePL-fair-by-type