472 12 88

Loubna Ben Allal

loubnabnl

https://loubnabnl.github.io/

AI & ML interests

LLMs, ML for code, Synthetic data

Recent Activity

updated a collection about 8 hours ago

SmolLM2

updated a collection about 8 hours ago

SmolLM2

updated a collection about 8 hours ago

SmolLM2

View all activity

Articles

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 271

CodeGemma - an official Google release for code LLMs

Apr 9

• 99

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 67

Organizations

Posts 4

Post

1220

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

Post

4999

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!

View all posts