The official NLPre-PL dataset - a uniformly paragraph-level divided version of NKJP1M corpus - the 1 million token balanced subcorpus of the National Corpus of Polish (Narodowy Korpus Jezyka Polskiego).
The NLPre dataset aims at fairly dividing the paragraphs length-wise and topic-wise into train, development, and test sets. Thus, we ensure a similar number of segments distribution per paragraph and avoid the situation when paragraphs with a small (or large) number of segments are available only e.g. during test time.
🤗 NLPre-PL Dataset 🤗 PDB-UD Dataset
Here are listed all available models, trained for the purpouse of creating NLPre-PL Benchmark.