instruction-pretrain
/

instruction-synthesizer

@@ -26,6 +26,8 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
 - Domain-Specific Models Pre-Trained from Llama3-8B:
   - [Finance-Llama3-8B](https://huggingface.co/instruction-pretrain/finance-Llama3-8B)
   - [Biomedicine-Llama3-8B](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B)
 ## Synthesize Instruction-Response Pairs to Augment Any Raw Corpora
@@ -92,105 +94,73 @@ for index, pair in enumerate(instruction_response_pairs):
     print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 ```
-### Advanced Usage: Synthesize Few-shot Examples
-A one-shot example consists of a piece of raw text followed by its instruction-response pairs. You can conduct multi-round inferece to synthesize a few-shot example: the instruction-response pairs of different raw texts share the same pattern.
-To accelerate synthesis, we use the [vLLM framework](https://github.com/vllm-project/vllm?tab=readme-ov-file):
-<details>
-<summary> Click to expand </summary>
-1. Set up dependencies:
 Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
 ```bash
 pip install vllm
 ```
-2. Synthesize:
-```python
-from vllm import LLM, SamplingParams
-# Put your list of raw texts here,
-# a list of M raw texts can be coverted into an M-shot example:
-text_list = [
-    "Genetically and medically susceptible workers.\nThe likelihood of an individual becoming ill from a hazardous material or condition is strongly influenced by both their genetic makeup and their underlying state of health. Although the past decade has seen great advances in understanding human variation in health and genetic polymorphisms and in the diagnosis and treatment of disease, much less progress has been made in effectively using this information to protect worker health. Scientific evidence for increased susceptibility often is weak and rarely satisfies legal thresholds for sufficient risk to warrant exclusion from a particular job. When public safety is a major concern, many legally mandated exclusions are not well justified. Medical opinions about fitness to work should be based upon a systematic and credible analysis of the condition, its relationship to ability and risk for a particular job, and knowledge of possible accommodations. Conclusions should reflect the limitations of scientific knowledge and guidance from antidiscrimination legislation.",
-    "Exclusive Breastfeeding for Twin Babies and Its Influencing Factors: A Study in East Java, Indonesia.\nThis study aimed to identify the factors that influence the success of exclusive breastfeeding in twins. This cross-sectional study was conducted on 184 mothers who had twins aged 6-23 months in Malang Raya, East Java, Indonesia and used the consecutive sampling technique. The data was collected through distributing questionnaires containing questions related to knowledge about exclusive breastfeeding, breastfeeding self-efficacy, and the support of family and certified health workers. Multinomial regression statistical test results show that the most influential factor for the success of exclusive breastfeeding with twins was breastfeeding self-efficacy (OR 0.111; 95% CI 0.033-0.387). A high level of breastfeeding self-efficacy can increase a mother's confidence to be able to provide exclusive breastfeeding for twins. This study suggests that nurses can provide breastfeeding counselling to improve breastfeeding self-efficacy."]
-# Create a sampling params object.
-sampling_params = SamplingParams(temperature=0, max_tokens=400)
-# Load the model and tokenizer
-llm = LLM(model="instruction-pretrain/instruction-synthesizer", max_model_len=4096)
-# Templates (please do NOT change them)
-context_template = ' <CON> {context} </CON>'
-QA_template = '<QUE> {question} <ANS> {answer} </END>'
-delimiter = '\n\n'
-bos_token = '<s>'
-eos_token = '</s>'
-def cook_context(raw_context):
-    """Format the context."""
-    return bos_token + context_template.replace('{context}', raw_context) + delimiter
-def cook_instruction_response_pairs(QA_list):
-    """Format downstream instruction(Q)-response(A) pairs."""
-    ins_res_list = []
-    for qa_entry in QA_list:
-        qa = QA_template.replace('{question}', qa_entry['Q']).replace('{answer}', qa_entry['A'])
-        ins_res_list.append(qa)
-    return delimiter.join(ins_res_list) + eos_token
-def parse_pred(pred):
-    """Extract the list of instruction-response pairs from the prediction"""
-    QA_str_list = pred.split('</END>')
-    if not pred.endswith('</END>'):
-        QA_str_list = QA_str_list[:-1]
-    QA_list = []
-    raw_questions = []
-    for QA_str in QA_str_list:
-        try:
-            assert len(QA_str.split('<ANS>')) == 2, f'invalid QA string: {QA_str}'
-            Q_str, A_str = QA_str.split('<ANS>')
-            Q_str, A_str = Q_str.strip(), A_str.strip()
-            assert Q_str.startswith('<QUE>'), f'invalid question string: {Q_str} in QA_str: {QA_str}'
-            assert len(A_str) > 0, f'invalid answer string in QA_str: {QA_str}'
-            Q_str = Q_str.replace('<QUE>', '').strip()
-            assert Q_str.lower() not in raw_questions, f'duplicate question: {Q_str}'
-            QA_list.append({'Q': Q_str, 'A': A_str})
-            raw_questions.append(Q_str.lower())
-        except:
-            pass
-    return QA_list
-def get_instruction_response_pairs(context):
-    '''Prompt the synthesizer to generate instruction-response pairs based on the given context'''
-    outputs = llm.generate(context, sampling_params, use_tqdm=False)
-    pred = outputs[0].outputs[0].text
-    return parse_pred(pred)
-# Process each text and generate instruction-response pairs in multi-round inference:
-previous_examples = []
-for cur_text in text_list:
-    # Prepend raw texts and instruction-response pairs of previous examples to the current text
-    context = ''
-    for previous_example in previous_examples:
-        context += cook_context(previous_example['text']) + cook_instruction_response_pairs(previous_example['instruction_response_pairs'])
-    context += cook_context(cur_text)
-    # Get the generated instruction-response paris
-    instruction_response_pairs = get_instruction_response_pairs(context)
-    previous_examples.append({'text': cur_text, 'instruction_response_pairs': instruction_response_pairs})
-# Concatenate the raw texts and instruction-response pairs of M rounds to consititute an M-shot example
-for example in previous_examples:
-    print(f'# Raw Text:\n{example["text"]}\n')
-    for index, pair in enumerate(example['instruction_response_pairs']):
-        print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 ```
-</details>
 ## Citation

 - Domain-Specific Models Pre-Trained from Llama3-8B:
   - [Finance-Llama3-8B](https://huggingface.co/instruction-pretrain/finance-Llama3-8B)
   - [Biomedicine-Llama3-8B](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B)
+- General Instruction-Augmented Corpora: [general-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/general-instruction-augmented-corpora)
+- Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): [medicine-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/medicine-instruction-augmented-corpora)
 ## Synthesize Instruction-Response Pairs to Augment Any Raw Corpora
     print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 ```
+### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
+1. Set up dependencies:
+```bash
+git clone https://github.com/microsoft/LMOps.git
+cd LMOps/instruction_pretrain
+```
 Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
 ```bash
 pip install vllm
 ```
+2. Synthesize and Templify Few-shot Examples for Pre-Training
+A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
+Suppose there are N pieces of raw text in the corpora, and you would like to covert them into M-shot examples:
+```python
+from vllm import LLM, SamplingParams
+from utils.read_compre import get_dataset, cook_pt_entries, run
+# Put your list of raw texts here
+raw_texts = [
+    "Genetically and medically susceptible workers.\nThe likelihood of an individual becoming ill from a hazardous material or condition is strongly influenced by both their genetic makeup and their underlying state of health. Although the past decade has seen great advances in understanding human variation in health and genetic polymorphisms and in the diagnosis and treatment of disease, much less progress has been made in effectively using this information to protect worker health. Scientific evidence for increased susceptibility often is weak and rarely satisfies legal thresholds for sufficient risk to warrant exclusion from a particular job. When public safety is a major concern, many legally mandated exclusions are not well justified. Medical opinions about fitness to work should be based upon a systematic and credible analysis of the condition, its relationship to ability and risk for a particular job, and knowledge of possible accommodations. Conclusions should reflect the limitations of scientific knowledge and guidance from antidiscrimination legislation.",
+    "Exclusive Breastfeeding for Twin Babies and Its Influencing Factors: A Study in East Java, Indonesia.\nThis study aimed to identify the factors that influence the success of exclusive breastfeeding in twins. This cross-sectional study was conducted on 184 mothers who had twins aged 6-23 months in Malang Raya, East Java, Indonesia and used the consecutive sampling technique. The data was collected through distributing questionnaires containing questions related to knowledge about exclusive breastfeeding, breastfeeding self-efficacy, and the support of family and certified health workers. Multinomial regression statistical test results show that the most influential factor for the success of exclusive breastfeeding with twins was breastfeeding self-efficacy (OR 0.111; 95% CI 0.033-0.387). A high level of breastfeeding self-efficacy can increase a mother's confidence to be able to provide exclusive breastfeeding for twins. This study suggests that nurses can provide breastfeeding counselling to improve breastfeeding self-efficacy."]
+N = len(raw_texts) # Number of raw texts
+M = 2  # M-shot example
+max_model_len = 4096 # max squence len of the LM you intend to pre-train
+max_new_tokens = 400 # max number of tokens for the augmented instruction-response pairs
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0, max_tokens=max_new_tokens)
+# Load the model and tokenizer
+llm = LLM(model="instruction-pretrain/instruction-synthesizer", max_model_len=max_model_len)
+# 1. multi-round inference to get the prediction
+prev_examples = []
+BSZ = (N+M-1)//M
+for round in range(M):
+    cur_raw_texts = raw_texts[round*BSZ: (round+1)*BSZ]
+    # load data
+    split = get_dataset(prev_examples=prev_examples,
+                        cur_raw_texts=cur_raw_texts,
+                        max_model_len=max_model_len,
+                        max_new_tokens=max_new_tokens)
+    prev_examples = run(split, llm, sampling_params)
+# 2. templify the data for subsequent pre-training
+instruction_augmented_texts = []
+for idx, entry in enumerate(prev_examples):
+    texts = cook_pt_entries(read_collection=entry, random_seed=idx+12345)
+                                                # change random seed for each entry for diveristy
+    instruction_augmented_texts.extend(texts)
+# 3. print out the results
+for idx, text in enumerate(instruction_augmented_texts):
+    print(f'## Instruction-augmented Text {idx+1}\n{text}\n')
+# Now you can use `instruction_augmented_texts` for pre-training!
 ```
 ## Citation