Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
Inference Endpoints
instruction-pretrain commited on
Commit
dd4444c
β€’
1 Parent(s): f894ca2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -8
README.md CHANGED
@@ -15,6 +15,7 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
15
  </p>
16
 
17
  **************************** **Updates** ****************************
 
18
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
19
  <p align='left'>
20
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
@@ -22,8 +23,9 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
22
  * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
23
 
24
  ## Resources
25
- **πŸ€— We share our data and models with example usages, feel free to open any issues or discussions! πŸ€—**
26
 
 
27
  - Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
28
  - Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
29
  - General Models Pre-Trained from Scratch (on 100B tokes):
@@ -43,9 +45,12 @@ We conduct multitask fine-tuning on a language model to develop an instruction s
43
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
44
  </p>
45
 
46
- ### Basic Usage: Synthesize instruction-response pairs based on a given raw text
47
 
48
  **πŸ’— Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) πŸ’—**
 
 
 
49
  ```python
50
  from transformers import AutoModelForCausalLM, AutoTokenizer
51
 
@@ -98,11 +103,15 @@ print(f'# Context:\n{context}\n')
98
  for index, pair in enumerate(instruction_response_pairs):
99
  print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
100
  ```
 
101
 
102
- ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
103
  We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
104
 
105
- 1. Set up dependencies:
 
 
 
106
 
107
  ```bash
108
  git clone https://github.com/microsoft/LMOps.git
@@ -115,7 +124,7 @@ Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/get
115
  pip install vllm
116
  ```
117
 
118
- 2. Synthesize and Templify Few-shot Examples for Pre-Training
119
 
120
  A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
121
 
@@ -168,13 +177,20 @@ for idx, text in enumerate(instruction_augmented_texts):
168
 
169
  # Now you can use `instruction_augmented_texts` for pre-training!
170
  ```
 
171
 
172
  **Pre-Training Suggestions:**
173
 
174
- Except for the pre-training data, *Instruction Pre-Training* keeps all other pre-training settings the same as *Vanilla Pre-Training*.
 
 
175
 
176
  1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
177
- 2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
 
 
 
 
178
 
179
  ## Citation
180
  If you find our work helpful, please cite us:
@@ -189,7 +205,7 @@ Instruction Pre-Training
189
  }
190
  ```
191
 
192
- [AdaptLLM](https://huggingface.co/papers/2309.09530)
193
  ```bibtex
194
  @inproceedings{
195
  cheng2024adapting,
 
15
  </p>
16
 
17
  **************************** **Updates** ****************************
18
+ * 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
19
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
20
  <p align='left'>
21
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
 
23
  * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
24
 
25
  ## Resources
26
+ **πŸ€— We share our data and models with example usages, feel free to open any issues or discussions at [this page](https://huggingface.co/papers/2406.14491)! πŸ€—**
27
 
28
+ - Thanks to the demo [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) for implementing our approach
29
  - Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
30
  - Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
31
  - General Models Pre-Trained from Scratch (on 100B tokes):
 
45
  <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
46
  </p>
47
 
48
+ ### 1. Basic Usage: Synthesize instruction-response pairs based on a given raw text
49
 
50
  **πŸ’— Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) πŸ’—**
51
+ <details>
52
+ <summary> Click to expand </summary>
53
+
54
  ```python
55
  from transformers import AutoModelForCausalLM, AutoTokenizer
56
 
 
103
  for index, pair in enumerate(instruction_response_pairs):
104
  print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
105
  ```
106
+ </details>
107
 
108
+ ### 2. Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
109
  We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
110
 
111
+ <details>
112
+ <summary> Click to expand </summary>
113
+
114
+ 1). Set up dependencies:
115
 
116
  ```bash
117
  git clone https://github.com/microsoft/LMOps.git
 
124
  pip install vllm
125
  ```
126
 
127
+ 2). Synthesize and Templify Few-shot Examples for Pre-Training
128
 
129
  A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
130
 
 
177
 
178
  # Now you can use `instruction_augmented_texts` for pre-training!
179
  ```
180
+ </details>
181
 
182
  **Pre-Training Suggestions:**
183
 
184
+ Except for the pre-training data, *Instruction Pre-Training* keeps all other settings the same as *Vanilla Pre-Training*.
185
+
186
+ Therefore, you can easily use any training framework, such as [OLMo](https://github.com/allenai/OLMo) (for pre-training from scratch) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) (for continual pre-training), to train on the templified instruction-augmented corpora.
187
 
188
  1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
189
+ 2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens). Each example from OpenOrca is formulated as "{question} {response}", with a blank space used to connect the question and response.
190
+
191
+ Let's try our method in continual pre-training for a quick start---it works easily!
192
+
193
+ Feel free to ask for any suggestions at [this page](https://huggingface.co/papers/2406.14491); we will reply ASAPπŸ€—!
194
 
195
  ## Citation
196
  If you find our work helpful, please cite us:
 
205
  }
206
  ```
207
 
208
+ [Adapt LLM to Domains](https://huggingface.co/papers/2309.09530)
209
  ```bibtex
210
  @inproceedings{
211
  cheng2024adapting,