davidberenstein1957 HF staff commited on
Commit
fd936a6
·
1 Parent(s): 7079e08

feat: add hidden token button

Browse files

docs: add message free generation
docs: update formatting FAQ
feat: remove session state message
feat: add interactive pipeline composer

src/distilabel_dataset_generator/apps/faq.py CHANGED
@@ -1,50 +1,63 @@
1
  import gradio as gr
2
 
3
  with gr.Blocks() as app:
4
- gr.Markdown(
5
- """### FAQ
 
 
 
 
 
 
 
 
6
 
7
- <img src="https://huggingface.co/spaces/argilla/distilabel-dataset-generator/resolve/main/assets/image.png" alt="Distilabel Dataset Generator" style="width: 300px;">
8
 
9
- #### What is Distilabel Dataset Generator?
10
 
11
- Distilabel Dataset Generator is a tool that allows you to easily create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and advanced language models to generate synthetic data tailored to your specific needs.
 
 
 
 
 
 
 
12
 
13
- This tool simplifies the process of creating custom datasets, enabling you to:
14
- - Define the characteristics of your desired dataset
15
- - Generate system prompts automatically
16
- - Create sample datasets for quick iteration
17
- - Produce full-scale datasets with customizable parameters
18
- - Push your generated datasets directly to the Hugging Face Hub
19
 
20
- By using Distilabel Dataset Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
21
 
22
- #### How is this free?
23
 
24
- The current implementation is based on [Free Serverless Hugging Face Inference Endpoints](https://huggingface.co/docs/api-inference/index). They are rate limited but free to use for anyone on the Hugging Face Hub. You can re-use the underlying pipeline to generate data with other [distilabel LLM integrations](https://distilabel.argilla.io/dev/components-gallery/llms/).
25
 
26
- #### What is distilabel?
27
 
28
- Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
29
 
30
- #### What is synthetic data?
31
 
32
- Synthetic data is data generated by an AI model, instead of being collected from the real world.
33
 
34
- #### What is AI feedback?
35
 
36
- AI feedback is feedback provided by an AI model, instead of being provided by a human.
37
 
38
- #### How is distilabel different from other synthetic data generation frameworks?
39
 
40
- Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. So, Distilabel is focused and specifically designed to be a tool that for scalable and reliable synthetic data generation.
41
 
42
- #### What do people use distilabel for?
43
 
44
- The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel).
45
-
46
- - The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use distilabel to **synthesize data on an immense scale**.
47
- - Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B), show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
48
- - The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.
49
- """
50
- )
 
 
 
 
1
  import gradio as gr
2
 
3
  with gr.Blocks() as app:
4
+ with gr.Row():
5
+ with gr.Column(scale=1):
6
+ pass
7
+ with gr.Column(scale=3):
8
+ gr.HTML(
9
+ """
10
+ <div style="text-align: justify;">
11
+ <div style="text-align: center;">
12
+ <img src="https://huggingface.co/spaces/argilla/distilabel-dataset-generator/resolve/main/assets/image.png" alt="Distilabel Dataset Generator" style="width: 300px; margin: 20px auto;">
13
+ </div>
14
 
15
+ <h4 style="text-align: center;">What is Distilabel Dataset Generator?</h4>
16
 
17
+ <p>Distilabel Dataset Generator is a tool that allows you to easily create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and advanced language models to generate synthetic data tailored to your specific needs.</p>
18
 
19
+ <p>This tool simplifies the process of creating custom datasets, enabling you to:</p>
20
+ <ul>
21
+ <li>Define the characteristics of your desired dataset</li>
22
+ <li>Generate system prompts automatically</li>
23
+ <li>Create sample datasets for quick iteration</li>
24
+ <li>Produce full-scale datasets with customizable parameters</li>
25
+ <li>Push your generated datasets directly to the Hugging Face Hub</li>
26
+ </ul>
27
 
28
+ <p>By using Distilabel Dataset Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.</p>
 
 
 
 
 
29
 
30
+ <h4 style="text-align: center;">How is this free?</h4>
31
 
32
+ <p>The current implementation is based on <a href="https://huggingface.co/docs/api-inference/index" target="_blank">Free Serverless Hugging Face Inference Endpoints</a>. They are rate limited but free to use for anyone on the Hugging Face Hub. You can re-use the underlying pipeline to generate data with other <a href="https://distilabel.argilla.io/dev/components-gallery/llms/" target="_blank">distilabel LLM integrations</a>.</p>
33
 
34
+ <h4 style="text-align: center;">What is distilabel?</h4>
35
 
36
+ <p>Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.</p>
37
 
38
+ <h4 style="text-align: center;">What is synthetic data?</h4>
39
 
40
+ <p>Synthetic data is data generated by an AI model, instead of being collected from the real world.</p>
41
 
42
+ <h4 style="text-align: center;">What is AI feedback?</h4>
43
 
44
+ <p>AI feedback is feedback provided by an AI model, instead of being provided by a human.</p>
45
 
46
+ <h4 style="text-align: center;">How is distilabel different from other frameworks?</h4>
47
 
48
+ <p>Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. So, Distilabel is focused and specifically designed to be a tool that for scalable and reliable synthetic data generation.</p>
49
 
50
+ <h4 style="text-align: center;">What do people use distilabel for?</h4>
51
 
52
+ <p>The Argilla community uses distilabel to create amazing <a href="https://huggingface.co/datasets?other=distilabel" target="_blank">datasets</a> and <a href="https://huggingface.co/models?other=distilabel" target="_blank">models</a>.</p>
53
 
54
+ <ul>
55
+ <li>The <a href="https://huggingface.co/datasets/argilla/OpenHermesPreferences" target="_blank">1M OpenHermesPreference</a> is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use distilabel to <strong>synthesize data on an immense scale</strong>.</li>
56
+ <li>Our <a href="https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs" target="_blank">distilabeled Intel Orca DPO dataset</a> and the <a href="https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B" target="_blank">improved OpenHermes model</a>, show how we <strong>improve model performance by filtering out 50%</strong> of the original dataset through <strong>AI feedback</strong>.</li>
57
+ <li>The <a href="https://github.com/davanstrien/haiku-dpo" target="_blank">haiku DPO data</a> outlines how anyone can create a <strong>dataset for a specific task</strong> and <strong>the latest research papers</strong> to improve the quality of the dataset.</li>
58
+ </ul>
59
+ </div>
60
+ """
61
+ )
62
+ with gr.Column(scale=1):
63
+ pass
src/distilabel_dataset_generator/apps/sft.py CHANGED
@@ -5,20 +5,20 @@ from typing import Union
5
  import gradio as gr
6
  import pandas as pd
7
  from distilabel.distiset import Distiset
8
- from huggingface_hub import whoami
9
 
10
  from src.distilabel_dataset_generator.pipelines.sft import (
11
  DEFAULT_DATASET,
12
  DEFAULT_DATASET_DESCRIPTION,
13
  DEFAULT_SYSTEM_PROMPT,
14
  PROMPT_CREATION_PROMPT,
 
15
  get_pipeline,
16
  get_prompt_generation_step,
17
  )
18
  from src.distilabel_dataset_generator.utils import (
19
  get_login_button,
20
  get_org_dropdown,
21
- get_token
22
  )
23
 
24
 
@@ -60,13 +60,13 @@ def generate_sample_dataset(system_prompt, progress=gr.Progress()):
60
 
61
 
62
  def generate_dataset(
63
- system_prompt,
64
- num_turns=1,
65
- num_rows=5,
66
- private=True,
67
- org_name=None,
68
- repo_name=None,
69
- oauth_token: Union[gr.OAuthToken, None] = None,
70
  progress=gr.Progress(),
71
  ):
72
  repo_id = (
@@ -141,18 +141,13 @@ def generate_dataset(
141
  return pd.DataFrame(outputs)
142
 
143
 
144
- def generate_pipeline_code() -> str:
145
- with open("src/distilabel_dataset_generator/pipelines/sft.py", "r") as f:
146
- pipeline_code = f.read()
147
-
148
- return pipeline_code
149
-
150
  def swap_visibilty(profile: Union[gr.OAuthProfile, None]):
151
  if profile is None:
152
- return gr.update(elem_classes=["main_ui_logged_out"])
153
  else:
154
  return gr.update(elem_classes=["main_ui_logged_in"])
155
 
 
156
  css = """
157
  .main_ui_logged_out{opacity: 0.3; pointer-events: none}
158
  """
@@ -160,9 +155,16 @@ css = """
160
  with gr.Blocks(
161
  title="⚗️ Distilabel Dataset Generator",
162
  head="⚗️ Distilabel Dataset Generator",
163
- css=css
164
  ) as app:
165
- get_login_button()
 
 
 
 
 
 
 
166
  gr.Markdown("## Iterate on a sample dataset")
167
  with gr.Column() as main_ui:
168
  dataset_description = gr.TextArea(
@@ -237,6 +239,12 @@ with gr.Blocks(
237
  )
238
 
239
  with gr.Row(variant="panel"):
 
 
 
 
 
 
240
  org_name = get_org_dropdown()
241
  repo_name = gr.Textbox(label="Repo name", placeholder="dataset_name")
242
  private = gr.Checkbox(
@@ -281,6 +289,7 @@ with gr.Blocks(
281
  private,
282
  org_name,
283
  repo_name,
 
284
  ],
285
  outputs=[table],
286
  show_progress=True,
@@ -294,10 +303,28 @@ with gr.Blocks(
294
 
295
  with gr.Accordion("Run this pipeline on Distilabel", open=False):
296
  pipeline_code = gr.Code(
297
- value=generate_pipeline_code(),
 
 
298
  language="python",
299
  label="Distilabel Pipeline Code",
300
  )
301
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
302
  app.load(get_org_dropdown, outputs=[org_name])
303
  app.load(fn=swap_visibilty, outputs=main_ui)
 
5
  import gradio as gr
6
  import pandas as pd
7
  from distilabel.distiset import Distiset
 
8
 
9
  from src.distilabel_dataset_generator.pipelines.sft import (
10
  DEFAULT_DATASET,
11
  DEFAULT_DATASET_DESCRIPTION,
12
  DEFAULT_SYSTEM_PROMPT,
13
  PROMPT_CREATION_PROMPT,
14
+ generate_pipeline_code,
15
  get_pipeline,
16
  get_prompt_generation_step,
17
  )
18
  from src.distilabel_dataset_generator.utils import (
19
  get_login_button,
20
  get_org_dropdown,
21
+ get_token,
22
  )
23
 
24
 
 
60
 
61
 
62
  def generate_dataset(
63
+ system_prompt: str,
64
+ num_turns: int = 1,
65
+ num_rows: int = 5,
66
+ private: bool = True,
67
+ org_name: str = None,
68
+ repo_name: str = None,
69
+ oauth_token: str = None,
70
  progress=gr.Progress(),
71
  ):
72
  repo_id = (
 
141
  return pd.DataFrame(outputs)
142
 
143
 
 
 
 
 
 
 
144
  def swap_visibilty(profile: Union[gr.OAuthProfile, None]):
145
  if profile is None:
146
+ return gr.update(elem_classes=["main_ui_logged_out"]), gr.Mark
147
  else:
148
  return gr.update(elem_classes=["main_ui_logged_in"])
149
 
150
+
151
  css = """
152
  .main_ui_logged_out{opacity: 0.3; pointer-events: none}
153
  """
 
155
  with gr.Blocks(
156
  title="⚗️ Distilabel Dataset Generator",
157
  head="⚗️ Distilabel Dataset Generator",
158
+ css=css,
159
  ) as app:
160
+ with gr.Row():
161
+ with gr.Column(scale=1):
162
+ get_login_button()
163
+ with gr.Column(scale=2):
164
+ gr.Markdown(
165
+ "This token will only be used to push the dataset to the Hugging Face Hub. It won't be incurring any costs because we are using Free Serverless Inference Endpoints."
166
+ )
167
+
168
  gr.Markdown("## Iterate on a sample dataset")
169
  with gr.Column() as main_ui:
170
  dataset_description = gr.TextArea(
 
239
  )
240
 
241
  with gr.Row(variant="panel"):
242
+ hf_token = gr.Textbox(
243
+ label="Hugging Face Token",
244
+ placeholder="hf_...",
245
+ type="password",
246
+ visible=False,
247
+ )
248
  org_name = get_org_dropdown()
249
  repo_name = gr.Textbox(label="Repo name", placeholder="dataset_name")
250
  private = gr.Checkbox(
 
289
  private,
290
  org_name,
291
  repo_name,
292
+ hf_token,
293
  ],
294
  outputs=[table],
295
  show_progress=True,
 
303
 
304
  with gr.Accordion("Run this pipeline on Distilabel", open=False):
305
  pipeline_code = gr.Code(
306
+ value=generate_pipeline_code(
307
+ system_prompt.value, num_turns.value, num_rows.value
308
+ ),
309
  language="python",
310
  label="Distilabel Pipeline Code",
311
  )
312
 
313
+ system_prompt.change(
314
+ fn=generate_pipeline_code,
315
+ inputs=[system_prompt, num_turns, num_rows],
316
+ outputs=[pipeline_code],
317
+ )
318
+ num_turns.change(
319
+ fn=generate_pipeline_code,
320
+ inputs=[system_prompt, num_turns, num_rows],
321
+ outputs=[pipeline_code],
322
+ )
323
+ num_rows.change(
324
+ fn=generate_pipeline_code,
325
+ inputs=[system_prompt, num_turns, num_rows],
326
+ outputs=[pipeline_code],
327
+ )
328
+ app.load(get_token, outputs=[hf_token])
329
  app.load(get_org_dropdown, outputs=[org_name])
330
  app.load(fn=swap_visibilty, outputs=main_ui)
src/distilabel_dataset_generator/pipelines/sft.py CHANGED
@@ -129,13 +129,64 @@ DEFAULT_DATASET = pd.DataFrame(
129
  ],
130
  }
131
  )
 
 
 
 
 
 
132
 
133
 
134
- def get_pipeline(num_turns, num_rows, system_prompt):
135
  if num_turns == 1:
136
- output_mappings = {"instruction": "prompt", "response": "completion"}
137
  else:
138
- output_mappings = {"conversation": "messages"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  with Pipeline(name="sft") as pipeline:
140
  magpie = MagpieGenerator(
141
  llm=InferenceEndpointsLLM(
@@ -147,13 +198,7 @@ def get_pipeline(num_turns, num_rows, system_prompt):
147
  "temperature": 0.8, # it's the best value for Llama 3.1 70B Instruct
148
  "do_sample": True,
149
  "max_new_tokens": 2048,
150
- "stop_sequences": [
151
- "<|eot_id|>",
152
- "<|end_of_text|>",
153
- "<|start_header_id|>",
154
- "<|end_header_id|>",
155
- "assistant",
156
- ],
157
  },
158
  ),
159
  batch_size=2,
@@ -179,6 +224,7 @@ def get_prompt_generation_step():
179
  "temperature": 0.8,
180
  "max_new_tokens": 2048,
181
  "do_sample": True,
 
182
  },
183
  ),
184
  use_system_prompt=True,
 
129
  ],
130
  }
131
  )
132
+ _STOP_SEQUENCES = [
133
+ "<|eot_id|>",
134
+ "<|start_header_id|>",
135
+ "assistant",
136
+ " \n\n",
137
+ ]
138
 
139
 
140
+ def _get_output_mappings(num_turns):
141
  if num_turns == 1:
142
+ return {"instruction": "prompt", "response": "completion"}
143
  else:
144
+ return {"conversation": "messages"}
145
+
146
+
147
+ def generate_pipeline_code(system_prompt, num_turns, num_rows):
148
+ input_mappings = _get_output_mappings(num_turns)
149
+ code = f"""
150
+ from distilabel.pipeline import Pipeline
151
+ from distilabel.steps import KeepColumns
152
+ from distilabel.steps.tasks import MagpieGenerator
153
+ from distilabel.llms import InferenceEndpointsLLM
154
+
155
+ MODEL = "{MODEL}"
156
+ SYSTEM_PROMPT = "{system_prompt}"
157
+
158
+ with Pipeline(name="sft") as pipeline:
159
+ magpie = MagpieGenerator(
160
+ llm=InferenceEndpointsLLM(
161
+ model_id=MODEL,
162
+ tokenizer_id=MODEL,
163
+ magpie_pre_query_template="llama3",
164
+ generation_kwargs={{
165
+ "temperature": 0.8,
166
+ "do_sample": True,
167
+ "max_new_tokens": 2048,
168
+ "stop_sequences": {_STOP_SEQUENCES}
169
+ }}
170
+ ),
171
+ n_turns={num_turns},
172
+ num_rows={num_rows},
173
+ system_prompt=SYSTEM_PROMPT,
174
+ output_mappings={input_mappings},
175
+ )
176
+ keep_columns = KeepColumns(
177
+ columns={list(input_mappings.values())} + ["model_name"],
178
+ )
179
+ magpie.connect(keep_columns)
180
+
181
+ if __name__ == "__main__":
182
+ distiset = pipeline.run()
183
+ """
184
+ return code
185
+
186
+
187
+ def get_pipeline(num_turns, num_rows, system_prompt):
188
+ input_mappings = _get_output_mappings(num_turns)
189
+ output_mappings = input_mappings
190
  with Pipeline(name="sft") as pipeline:
191
  magpie = MagpieGenerator(
192
  llm=InferenceEndpointsLLM(
 
198
  "temperature": 0.8, # it's the best value for Llama 3.1 70B Instruct
199
  "do_sample": True,
200
  "max_new_tokens": 2048,
201
+ "stop_sequences": _STOP_SEQUENCES,
 
 
 
 
 
 
202
  },
203
  ),
204
  batch_size=2,
 
224
  "temperature": 0.8,
225
  "max_new_tokens": 2048,
226
  "do_sample": True,
227
+ "stop_sequences": _STOP_SEQUENCES,
228
  },
229
  ),
230
  use_system_prompt=True,
src/distilabel_dataset_generator/utils.py CHANGED
@@ -8,7 +8,7 @@ from gradio.oauth import (
8
  )
9
  from huggingface_hub import whoami
10
 
11
- if (
12
  all(
13
  [
14
  OAUTH_CLIENT_ID,
@@ -18,28 +18,19 @@ if (
18
  ]
19
  )
20
  or get_space() is None
21
- ):
 
 
22
  from gradio.oauth import OAuthToken
23
  else:
24
  OAuthToken = str
25
 
26
 
27
  def get_login_button():
28
- if (
29
- all(
30
- [
31
- OAUTH_CLIENT_ID,
32
- OAUTH_CLIENT_SECRET,
33
- OAUTH_SCOPES,
34
- OPENID_PROVIDER_URL,
35
- ]
36
- )
37
- or get_space() is None
38
- ):
39
- return gr.LoginButton(
40
- value="Sign in with Hugging Face! (This resets the session state.)",
41
- size="lg",
42
- )
43
 
44
 
45
  def get_duplicate_button():
@@ -51,6 +42,7 @@ def list_orgs(oauth_token: OAuthToken = None):
51
  if oauth_token is None:
52
  return []
53
  data = whoami(oauth_token.token)
 
54
  organisations = [
55
  entry["entity"]["name"]
56
  for entry in data["auth"]["accessToken"]["fineGrained"]["scoped"]
 
8
  )
9
  from huggingface_hub import whoami
10
 
11
+ _CHECK_IF_SPACE_IS_SET = (
12
  all(
13
  [
14
  OAUTH_CLIENT_ID,
 
18
  ]
19
  )
20
  or get_space() is None
21
+ )
22
+
23
+ if _CHECK_IF_SPACE_IS_SET:
24
  from gradio.oauth import OAuthToken
25
  else:
26
  OAuthToken = str
27
 
28
 
29
  def get_login_button():
30
+ return gr.LoginButton(
31
+ value="Sign in with Hugging Face!",
32
+ size="lg",
33
+ )
 
 
 
 
 
 
 
 
 
 
 
34
 
35
 
36
  def get_duplicate_button():
 
42
  if oauth_token is None:
43
  return []
44
  data = whoami(oauth_token.token)
45
+ print(data["auth"])
46
  organisations = [
47
  entry["entity"]["name"]
48
  for entry in data["auth"]["accessToken"]["fineGrained"]["scoped"]