Hugging Face TB Research

Enterprise

community

AI & ML interests

Exploring synthetic datasets, generated by Large Language Models (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Organization Card

Community About org cards

HuggingFaceTB

This is the home for small LLMs (SmolLM) and high quality pre-training datasets, such as Cosmopedia and Smollm-Corpus.

We released:

Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
Cosmo-1B a 1B model trained on Cosmopedia.
FineWeb-Edu: a filtered version of FineWeb dataset for educational content
Smollm-Corpus: the pre-training corpus of SmolLM models including Cosmopedia v0.2, FineWeb-Edu and Python-Edu.
SmolLM models and SmolLM2: a series of strong small models in three sizes: 135M, 360M and 1.7B

For more details check our blog posts: https://huggingface.co/blog/cosmopedia and https://huggingface.co/blog/smollm

Collections 6

spaces 3

SmolLM 360M Instruct WebGPU

A blazingly fast and powerful AI chatbot that runs locally.

Instant SmolLM

Run SmolLM-360M-Instruct in realtime with MLC WebLLM

Web clusters

models 26

HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF

Text Generation • Updated 4 days ago • 3.49k • 25

HuggingFaceTB/SmolLM2-135M-Instruct

Text Generation • Updated 4 days ago • 7.34k • 50

HuggingFaceTB/SmolLM2-360M

Text Generation • Updated 4 days ago • 2.54k • 20

HuggingFaceTB/SmolLM2-360M-Instruct

Text Generation • Updated 4 days ago • 7.95k • 42

HuggingFaceTB/SmolLM2-1.7B

Text Generation • Updated 4 days ago • 5.84k • 62

HuggingFaceTB/SmolLM2-1.7B-Instruct

Text Generation • Updated 4 days ago • 25.6k • • 290

HuggingFaceTB/SmolLM2-135M

Text Generation • Updated 4 days ago • 5.06k • 26

HuggingFaceTB/SmolLM2-360M-Instruct-GGUF

Updated 8 days ago • 707 • 12

HuggingFaceTB/SmolLM-1.7B

Text Generation • Updated 24 days ago • 9.93k • 159

HuggingFaceTB/SmolLM-135M-Instruct

Text Generation • Updated Sep 4 • 26.5k • 96

datasets 29

HuggingFaceTB/MATH

Updated 24 days ago • 213 • 1

HuggingFaceTB/smollm-corpus

Viewer • Updated Sep 6 • 237M • 29.3k • 239

HuggingFaceTB/everyday-conversations-llama3.1-2k

Viewer • Updated Aug 17 • 2.38k • 539 • 76

HuggingFaceTB/instruct-data-basics-smollm-H4

Viewer • Updated Aug 17 • 767 • 143

HuggingFaceTB/self-oss-instruct-sc2-H4

Viewer • Updated Aug 17 • 50.7k • 355 • 1

HuggingFaceTB/Magpie-Pro-300K-Filtered-H4

Viewer • Updated Aug 17 • 300k • 162 • 2

HuggingFaceTB/OpenHermes-2.5-H4

Viewer • Updated Aug 17 • 1M • 150 • 1

HuggingFaceTB/bisac_expanded_topics

Viewer • Updated Aug 14 • 34.2k • 40

HuggingFaceTB/cosmopedia

Viewer • Updated Aug 12 • 31.1M • 10.7k • 561

HuggingFaceTB/python-edu-annotations

Viewer • Updated Jul 31 • 491k • 57 • 2