Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 β’ 67
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 27
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper β’ 2411.14343 β’ Published 6 days ago β’ 7 β’ 2
RedPajama: an Open Dataset for Training Large Language Models Paper β’ 2411.12372 β’ Published 9 days ago β’ 47 β’ 3