view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 17 days ago • 96
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6 • 1
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 8
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29 • 8 • 2
Toxic Commons Collection Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated about 1 month ago • 2