instruction-pretrain
/

instruction-synthesizer

@@ -15,6 +15,7 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
 </p>
 **************************** **Updates** ****************************
 * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
@@ -22,8 +23,9 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
 * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
 ## Resources
-**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**
 - Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
 - Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
 - General Models Pre-Trained from Scratch (on 100B tokes):
@@ -43,9 +45,12 @@ We conduct multitask fine-tuning on a language model to develop an instruction s
     <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
 </p>
-### Basic Usage: Synthesize instruction-response pairs based on a given raw text
 **💗 Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) 💗**
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -98,11 +103,15 @@ print(f'# Context:\n{context}\n')
 for index, pair in enumerate(instruction_response_pairs):
     print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 ```
-### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
 We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
-1. Set up dependencies:
 ```bash
 git clone https://github.com/microsoft/LMOps.git
@@ -115,7 +124,7 @@ Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/get
 pip install vllm
 ```
-2. Synthesize and Templify Few-shot Examples for Pre-Training
 A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
@@ -168,13 +177,20 @@ for idx, text in enumerate(instruction_augmented_texts):
 # Now you can use `instruction_augmented_texts` for pre-training!
 ```
 **Pre-Training Suggestions:**
-Except for the pre-training data, *Instruction Pre-Training* keeps all other pre-training settings the same as *Vanilla Pre-Training*.
 1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
-2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens).
 ## Citation
 If you find our work helpful, please cite us:
@@ -189,7 +205,7 @@ Instruction Pre-Training
 }
 ```
-[AdaptLLM](https://huggingface.co/papers/2309.09530)
 ```bibtex
 @inproceedings{
 cheng2024adapting,

 </p>
 **************************** **Updates** ****************************
+* 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
 * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
 * 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)
 ## Resources
+**🤗 We share our data and models with example usages, feel free to open any issues or discussions at [this page](https://huggingface.co/papers/2406.14491)! 🤗**
+- Thanks to the demo [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) for implementing our approach
 - Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
 - Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
 - General Models Pre-Trained from Scratch (on 100B tokes):
     <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
 </p>
+### 1. Basic Usage: Synthesize instruction-response pairs based on a given raw text
 **💗 Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) 💗**
+<details>
+<summary> Click to expand </summary>
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 for index, pair in enumerate(instruction_response_pairs):
     print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 ```
+</details>
+### 2. Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
 We use vLLM to accelerate the synthesis process. On a single A100-80GB GPU, it takes about 2 days to synthesize instruction-response pairs for 1 billion tokens of raw corpora.
+<details>
+<summary> Click to expand </summary>
+1). Set up dependencies:
 ```bash
 git clone https://github.com/microsoft/LMOps.git
 pip install vllm
 ```
+2). Synthesize and Templify Few-shot Examples for Pre-Training
 A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
 # Now you can use `instruction_augmented_texts` for pre-training!
 ```
+</details>
 **Pre-Training Suggestions:**
+Except for the pre-training data, *Instruction Pre-Training* keeps all other settings the same as *Vanilla Pre-Training*.
+Therefore, you can easily use any training framework, such as [OLMo](https://github.com/allenai/OLMo) (for pre-training from scratch) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) (for continual pre-training), to train on the templified instruction-augmented corpora.
 1. For general pre-training from scratch, we recommend setting M = 2 and mixing the instruction-augmented corpora with unchanged raw corpora.
+2. For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) at a 1:1 ratio (counted by tokens). Each example from OpenOrca is formulated as "{question} {response}", with a blank space used to connect the question and response.
+Let's try our method in continual pre-training for a quick start---it works easily!
+Feel free to ask for any suggestions at [this page](https://huggingface.co/papers/2406.14491); we will reply ASAP🤗!
 ## Citation
 If you find our work helpful, please cite us:
 }
 ```
+[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530)
 ```bibtex
 @inproceedings{
 cheng2024adapting,