Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
Inference Endpoints
instruction-pretrain commited on
Commit
1f19d3c
1 Parent(s): ec2d76f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -82
README.md CHANGED
@@ -26,6 +26,8 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
26
  - Domain-Specific Models Pre-Trained from Llama3-8B:
27
  - [Finance-Llama3-8B](https://huggingface.co/instruction-pretrain/finance-Llama3-8B)
28
  - [Biomedicine-Llama3-8B](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B)
 
 
29
 
30
 
31
  ## Synthesize Instruction-Response Pairs to Augment Any Raw Corpora
@@ -92,105 +94,73 @@ for index, pair in enumerate(instruction_response_pairs):
92
  print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
93
  ```
94
 
95
- ### Advanced Usage: Synthesize Few-shot Examples
96
- A one-shot example consists of a piece of raw text followed by its instruction-response pairs. You can conduct multi-round inferece to synthesize a few-shot example: the instruction-response pairs of different raw texts share the same pattern.
97
 
98
- To accelerate synthesis, we use the [vLLM framework](https://github.com/vllm-project/vllm?tab=readme-ov-file):
99
- <details>
100
- <summary> Click to expand </summary>
 
101
 
102
- 1. Set up dependencies:
103
  Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
104
 
105
  ```bash
106
  pip install vllm
107
  ```
108
 
109
- 2. Synthesize:
110
- ```python
111
- from vllm import LLM, SamplingParams
112
 
113
- # Put your list of raw texts here,
114
- # a list of M raw texts can be coverted into an M-shot example:
115
- text_list = [
116
- "Genetically and medically susceptible workers.\nThe likelihood of an individual becoming ill from a hazardous material or condition is strongly influenced by both their genetic makeup and their underlying state of health. Although the past decade has seen great advances in understanding human variation in health and genetic polymorphisms and in the diagnosis and treatment of disease, much less progress has been made in effectively using this information to protect worker health. Scientific evidence for increased susceptibility often is weak and rarely satisfies legal thresholds for sufficient risk to warrant exclusion from a particular job. When public safety is a major concern, many legally mandated exclusions are not well justified. Medical opinions about fitness to work should be based upon a systematic and credible analysis of the condition, its relationship to ability and risk for a particular job, and knowledge of possible accommodations. Conclusions should reflect the limitations of scientific knowledge and guidance from antidiscrimination legislation.",
117
- "Exclusive Breastfeeding for Twin Babies and Its Influencing Factors: A Study in East Java, Indonesia.\nThis study aimed to identify the factors that influence the success of exclusive breastfeeding in twins. This cross-sectional study was conducted on 184 mothers who had twins aged 6-23 months in Malang Raya, East Java, Indonesia and used the consecutive sampling technique. The data was collected through distributing questionnaires containing questions related to knowledge about exclusive breastfeeding, breastfeeding self-efficacy, and the support of family and certified health workers. Multinomial regression statistical test results show that the most influential factor for the success of exclusive breastfeeding with twins was breastfeeding self-efficacy (OR 0.111; 95% CI 0.033-0.387). A high level of breastfeeding self-efficacy can increase a mother's confidence to be able to provide exclusive breastfeeding for twins. This study suggests that nurses can provide breastfeeding counselling to improve breastfeeding self-efficacy."]
118
 
119
- # Create a sampling params object.
120
- sampling_params = SamplingParams(temperature=0, max_tokens=400)
121
 
122
- # Load the model and tokenizer
123
- llm = LLM(model="instruction-pretrain/instruction-synthesizer", max_model_len=4096)
124
-
125
- # Templates (please do NOT change them)
126
- context_template = ' <CON> {context} </CON>'
127
- QA_template = '<QUE> {question} <ANS> {answer} </END>'
128
- delimiter = '\n\n'
129
- bos_token = '<s>'
130
- eos_token = '</s>'
131
-
132
- def cook_context(raw_context):
133
- """Format the context."""
134
- return bos_token + context_template.replace('{context}', raw_context) + delimiter
135
-
136
- def cook_instruction_response_pairs(QA_list):
137
- """Format downstream instruction(Q)-response(A) pairs."""
138
- ins_res_list = []
139
- for qa_entry in QA_list:
140
- qa = QA_template.replace('{question}', qa_entry['Q']).replace('{answer}', qa_entry['A'])
141
- ins_res_list.append(qa)
142
- return delimiter.join(ins_res_list) + eos_token
143
 
144
- def parse_pred(pred):
145
- """Extract the list of instruction-response pairs from the prediction"""
146
- QA_str_list = pred.split('</END>')
147
- if not pred.endswith('</END>'):
148
- QA_str_list = QA_str_list[:-1]
149
 
150
- QA_list = []
151
- raw_questions = []
152
- for QA_str in QA_str_list:
153
- try:
154
- assert len(QA_str.split('<ANS>')) == 2, f'invalid QA string: {QA_str}'
155
- Q_str, A_str = QA_str.split('<ANS>')
156
- Q_str, A_str = Q_str.strip(), A_str.strip()
157
- assert Q_str.startswith('<QUE>'), f'invalid question string: {Q_str} in QA_str: {QA_str}'
158
- assert len(A_str) > 0, f'invalid answer string in QA_str: {QA_str}'
159
- Q_str = Q_str.replace('<QUE>', '').strip()
160
- assert Q_str.lower() not in raw_questions, f'duplicate question: {Q_str}'
161
- QA_list.append({'Q': Q_str, 'A': A_str})
162
- raw_questions.append(Q_str.lower())
163
- except:
164
- pass
165
 
166
- return QA_list
 
 
 
167
 
168
- def get_instruction_response_pairs(context):
169
- '''Prompt the synthesizer to generate instruction-response pairs based on the given context'''
170
- outputs = llm.generate(context, sampling_params, use_tqdm=False)
171
- pred = outputs[0].outputs[0].text
172
- return parse_pred(pred)
173
 
174
- # Process each text and generate instruction-response pairs in multi-round inference:
175
- previous_examples = []
176
- for cur_text in text_list:
177
- # Prepend raw texts and instruction-response pairs of previous examples to the current text
178
- context = ''
179
- for previous_example in previous_examples:
180
- context += cook_context(previous_example['text']) + cook_instruction_response_pairs(previous_example['instruction_response_pairs'])
181
- context += cook_context(cur_text)
182
-
183
- # Get the generated instruction-response paris
184
- instruction_response_pairs = get_instruction_response_pairs(context)
185
- previous_examples.append({'text': cur_text, 'instruction_response_pairs': instruction_response_pairs})
186
-
187
- # Concatenate the raw texts and instruction-response pairs of M rounds to consititute an M-shot example
188
- for example in previous_examples:
189
- print(f'# Raw Text:\n{example["text"]}\n')
190
- for index, pair in enumerate(example['instruction_response_pairs']):
191
- print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
 
 
 
 
 
 
 
 
 
 
192
  ```
193
- </details>
194
 
195
 
196
  ## Citation
 
26
  - Domain-Specific Models Pre-Trained from Llama3-8B:
27
  - [Finance-Llama3-8B](https://huggingface.co/instruction-pretrain/finance-Llama3-8B)
28
  - [Biomedicine-Llama3-8B](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B)
29
+ - General Instruction-Augmented Corpora: [general-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/general-instruction-augmented-corpora)
30
+ - Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): [medicine-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/medicine-instruction-augmented-corpora)
31
 
32
 
33
  ## Synthesize Instruction-Response Pairs to Augment Any Raw Corpora
 
94
  print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
95
  ```
96
 
97
+ ### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
98
+ 1. Set up dependencies:
99
 
100
+ ```bash
101
+ git clone https://github.com/microsoft/LMOps.git
102
+ cd LMOps/instruction_pretrain
103
+ ```
104
 
 
105
  Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
106
 
107
  ```bash
108
  pip install vllm
109
  ```
110
 
111
+ 2. Synthesize and Templify Few-shot Examples for Pre-Training
 
 
112
 
113
+ A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.
 
 
 
 
114
 
115
+ Suppose there are N pieces of raw text in the corpora, and you would like to covert them into M-shot examples:
 
116
 
117
+ ```python
118
+ from vllm import LLM, SamplingParams
119
+ from utils.read_compre import get_dataset, cook_pt_entries, run
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
+ # Put your list of raw texts here
122
+ raw_texts = [
123
+ "Genetically and medically susceptible workers.\nThe likelihood of an individual becoming ill from a hazardous material or condition is strongly influenced by both their genetic makeup and their underlying state of health. Although the past decade has seen great advances in understanding human variation in health and genetic polymorphisms and in the diagnosis and treatment of disease, much less progress has been made in effectively using this information to protect worker health. Scientific evidence for increased susceptibility often is weak and rarely satisfies legal thresholds for sufficient risk to warrant exclusion from a particular job. When public safety is a major concern, many legally mandated exclusions are not well justified. Medical opinions about fitness to work should be based upon a systematic and credible analysis of the condition, its relationship to ability and risk for a particular job, and knowledge of possible accommodations. Conclusions should reflect the limitations of scientific knowledge and guidance from antidiscrimination legislation.",
124
+ "Exclusive Breastfeeding for Twin Babies and Its Influencing Factors: A Study in East Java, Indonesia.\nThis study aimed to identify the factors that influence the success of exclusive breastfeeding in twins. This cross-sectional study was conducted on 184 mothers who had twins aged 6-23 months in Malang Raya, East Java, Indonesia and used the consecutive sampling technique. The data was collected through distributing questionnaires containing questions related to knowledge about exclusive breastfeeding, breastfeeding self-efficacy, and the support of family and certified health workers. Multinomial regression statistical test results show that the most influential factor for the success of exclusive breastfeeding with twins was breastfeeding self-efficacy (OR 0.111; 95% CI 0.033-0.387). A high level of breastfeeding self-efficacy can increase a mother's confidence to be able to provide exclusive breastfeeding for twins. This study suggests that nurses can provide breastfeeding counselling to improve breastfeeding self-efficacy."]
 
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
+ N = len(raw_texts) # Number of raw texts
128
+ M = 2 # M-shot example
129
+ max_model_len = 4096 # max squence len of the LM you intend to pre-train
130
+ max_new_tokens = 400 # max number of tokens for the augmented instruction-response pairs
131
 
132
+ # Create a sampling params object.
133
+ sampling_params = SamplingParams(temperature=0, max_tokens=max_new_tokens)
 
 
 
134
 
135
+ # Load the model and tokenizer
136
+ llm = LLM(model="instruction-pretrain/instruction-synthesizer", max_model_len=max_model_len)
137
+
138
+ # 1. multi-round inference to get the prediction
139
+ prev_examples = []
140
+ BSZ = (N+M-1)//M
141
+ for round in range(M):
142
+ cur_raw_texts = raw_texts[round*BSZ: (round+1)*BSZ]
143
+ # load data
144
+ split = get_dataset(prev_examples=prev_examples,
145
+ cur_raw_texts=cur_raw_texts,
146
+ max_model_len=max_model_len,
147
+ max_new_tokens=max_new_tokens)
148
+ prev_examples = run(split, llm, sampling_params)
149
+
150
+
151
+ # 2. templify the data for subsequent pre-training
152
+ instruction_augmented_texts = []
153
+ for idx, entry in enumerate(prev_examples):
154
+ texts = cook_pt_entries(read_collection=entry, random_seed=idx+12345)
155
+ # change random seed for each entry for diveristy
156
+ instruction_augmented_texts.extend(texts)
157
+
158
+ # 3. print out the results
159
+ for idx, text in enumerate(instruction_augmented_texts):
160
+ print(f'## Instruction-augmented Text {idx+1}\n{text}\n')
161
+
162
+ # Now you can use `instruction_augmented_texts` for pre-training!
163
  ```
 
164
 
165
 
166
  ## Citation