fasttext-oh-eli5 / README.md
jeffreywpli's picture
Update README.md
cd8b714 verified
metadata
license: mit

Fasttext model used for filtering in DataComp-LM to produce DCLM-Baseline.

The model classifies between __label__hq and __label__cc which correspond to "high-quality" (i.e., OH2.5 and Reddit ELI5 data) and "low-quality" (i.e., web-crawled data from Common Crawl) respectively. We use the score given to __label__hq to filter our documents via a percentile-based threshold.

See our dclm repo for documentation about how we applied to to filter data in our experiments.

See fasttext documentation for general documentation on fasttext classifiers and how to use them with python.