File size: 15,914 Bytes
0bf7deb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cbe0f126",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "# Introducing Genstruct\n",
    "Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.\n",
    "\n",
    "Two previous methods aimed to improve upon this naive prompting approach:\n",
    "- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.\n",
    "- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.\n",
    "\n",
    "Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations.  To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.\n",
    "\n",
    "Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf417800",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "## Generating instruction pairs\n",
    "Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.\n",
    "\n",
    "Like any other Mistral model, it can be imported from Huggingface Hub as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7492d81a",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/user/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 1/3 [00:01<00:03,  1.75s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:  67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 2/3 [00:03<00:01,  1.72s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:04<00:00,  1.64s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:04<00:00,  1.66s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "MODEL_NAME = 'NousResearch/Genstruct-7B'\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', load_in_8bit=True)\n",
    "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34f73db8",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:\n",
    "```\n",
    "[[[Title]]] p-value\n",
    "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
    "\n",
    "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
    "\n",
    "[[[User]]]\n",
    "```\n",
    "\n",
    "The model then completes from `[[[User]]]`, generating an instruction and a response.\n",
    "\n",
    "\n",
    "To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2617d9f5",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "msg =[{\n",
    "    'title': 'p-value',\n",
    "    'content': \"The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\"\n",
    "}]\n",
    "inputs = tokenizer.apply_chat_template(msg, return_tensors='pt').cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997b8d92",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1429b6bc",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[Title]]] p-value\n",
      "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
      "\n",
      "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
      "\n",
      "[[[User]]]  The share prices of two rival companies, A and B, have been monitored for many years, allowing a large number of data points for rigorous statistical analysis. This year's summer, which is known to affect share prices, had two distinct sub-periods, A and B, which were roughly equal in length. The company 'A's share price, during sub-period A, was found to be 2.35, using a test statistic T. The same statistic, for sub-period B, was 1.45.\n",
      "Which company, A or B, had a smaller p-value?\n",
      "[[[Assistant]]] In the context of statistical analysis, the p-value is a key component of null hypothesis testing. It signifies the probability of obtaining results equal to or more extreme than the observed value of the statistic, under the assumption that the null hypothesis is true. In other words, the lower the p-value is, the less likely the result is if the null hypothesis were true.\n",
      "\n",
      "In this case, we are comparing the share prices of companies A and B over two distinct sub-periods, A and B, during which the summer had a notable impact on share prices. Using a test statistic T, we found that for sub-period A, the value was 2.35, and for sub-period B, it was 1.45.\n",
      "\n",
      "When we calculate the p-value for these results, assuming the null hypothesis is true, if we were to get a result as extreme as 2.35 (or more extreme), the probability of that occurring is lower for company A than it is for company B and the statistic 1.45. This means that, all other things being equal, the evidence provided by the data is stronger against the null hypothesis for company A than it is for company B.\n",
      "\n",
      "Therefore, company A would have a smaller p-value than company B, which means that, based on the data, we would have a lower probability of getting the observed result of 2.35 for company A if the null hypothesis were true. Consequently, the result for company A is a stronger indicator that it's time to reject the null hypothesis.\n",
      "\n",
      "So, the company with the smaller p-value is A.\n"
     ]
    }
   ],
   "source": [
    "gen = tokenizer.decode(model.generate(inputs, max_new_tokens=512)[0]).split(tokenizer.eos_token)[0]\n",
    "print(gen)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0848af10",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Note that the model is optimized for single-paragraph extracts from Wikipedia articles. You may have varying luck with other input types.\n",
    "\n",
    "## Filtering outputs using a reward model\n",
    "The model may occasionally generate incorrect or improperly formatted output - the likelihood of this can be reduced with clever sampling methods, such as rejection sampling using a reward model, or even simple regex filtering.\n",
    "\n",
    "For instance, we might consider `OpenAssistant/reward-model-deberta-v3-large-v2` as a reward model, and perform best-of-n sampling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a93868ac",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[Title]]] p-value\n",
      "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
      "\n",
      "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
      "\n",
      "[[[User]]]  Two medical procedures were compared by flipping 2 coins, procedure A assumed to be better and so it was labeled head, while procedure B was labeled as tail for a flip. The coins where then flipped 25 times, with the following results:[{'Tails', 12}, {'Heads', 13}]\n",
      "\n",
      "Which procedure had better results with statistical significance?\n",
      "[[[Assistant]]] The statistical significance of the outcomes between the two procedures can be assessed using the p-value, which represents the probability of obtaining results as extreme as, or more extreme than, those observed, if the null hypothesis is true.\n",
      "\n",
      "In this case, let's assume that the null hypothesis would suggest that there is no difference between the two procedures, so each one should result in heads or tails with approximately equal probability (assuming fair coins).\n",
      "\n",
      "To calculate the p-value, we can use the statistic T, which in this context could be any relevant statistic calculated from the data, such as the difference in the number of flips resulting in heads or tails. We want to find the p-value corresponding to the observed value of T when the data is Tails = 12, Heads\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "N = 4\n",
    "\n",
    "rm_tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2')\n",
    "rm_model = AutoModelForSequenceClassification.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2', torch_dtype=torch.bfloat16)\n",
    "\n",
    "def extract_pair(resp):\n",
    "    response = resp.split('[[[Content]]]')[1]\n",
    "    inst, resp = resp.split('[[[User]]]')[:2]\n",
    "    return inst.strip(), resp.strip()\n",
    "    \n",
    "def score(resp):\n",
    "    inst, resp = extract_pair(resp.split(tokenizer.eos_token)[0])\n",
    "    \n",
    "    with torch.no_grad():\n",
    "        inputs = rm_tokenizer(inst, resp, return_tensors='pt')\n",
    "        score = float(rm_model(**inputs).logits[0].cpu())\n",
    "        return score\n",
    "\n",
    "gens = tokenizer.batch_decode(model.generate(inputs, max_new_tokens=256, num_return_sequences=N, do_sample=True))\n",
    "print(max(gens, key=score))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}