NLPre-PL Dataset

The official NLPre-PL dataset - a uniformly paragraph-level divided version of NKJP1M corpus – the 1-million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Języka Polskiego).

The NLPre dataset aims at fairly dividing the paragraphs length-wise and topic-wise into train, development, and test sets. Thus, we ensure a similar number of segments distribution per paragraph and avoid the situation when paragraphs with a small (or large) number of segments are available only e.g. during test time.

🤗 NLPre-PL Dataset 🤗 PDB-UD Dataset

NLPre-PL Trained models

Here are listed all available models, trained for the purpouse of creating NLPre-PL Benchmark.


COMBO + HerBERT + PDB-UD COMBO + fasttext + PDB-UD COMBO + HerBERT + NLPrePL-fair-by-name COMBO + HerBERT + NLPrePL-fair-by-type COMBO + fasttext + NLPrePL-fair-by-name COMBO + fasttext + NLPrePL-fair-by-type


COMBO + HerBERT + NLPrePL-fair-by-name COMBO + HerBERT + NLPrePL-fair-by-type COMBO + fasttext + NLPrePL-fair-by-name COMBO + fasttext + NLPrePL-fair-by-type