Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,14 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# HuggingFaceTB
|
11 |
+
This is the home of synthetic datasets for pre-training, such as [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia). We're trying to scale synthetic data generation by curating
|
12 |
+
diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like [llm-swarm](https://github.com/huggingface/llm-swarm).
|
13 |
+
|
14 |
+
We recently released:
|
15 |
+
|
16 |
+
- [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia): the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
|
17 |
+
- [Cosmo-1B](https://huggingface.co/HuggingFaceTB/cosmo-1b) a 1B model trained on Cosmopedia.
|
18 |
+
|
19 |
+
For more details check our blogpost: https://huggingface.co/blog/cosmopedia
|
20 |
+
|