File size: 15,914 Bytes
0bf7deb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 |
{
"cells": [
{
"cell_type": "markdown",
"id": "cbe0f126",
"metadata": {
"jupyter": {
"source_hidden": false
}
},
"source": [
"# Introducing Genstruct\n",
"Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.\n",
"\n",
"Two previous methods aimed to improve upon this naive prompting approach:\n",
"- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.\n",
"- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.\n",
"\n",
"Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations. To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.\n",
"\n",
"Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses."
]
},
{
"cell_type": "markdown",
"id": "bf417800",
"metadata": {
"jupyter": {
"source_hidden": false
}
},
"source": [
"## Generating instruction pairs\n",
"Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.\n",
"\n",
"Like any other Mistral model, it can be imported from Huggingface Hub as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7492d81a",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/user/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Loading checkpoint shards: 33%|ββββ | 1/3 [00:01<00:03, 1.75s/it]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Loading checkpoint shards: 67%|βββββββ | 2/3 [00:03<00:01, 1.72s/it]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Loading checkpoint shards: 100%|ββββββββββ| 3/3 [00:04<00:00, 1.64s/it]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Loading checkpoint shards: 100%|ββββββββββ| 3/3 [00:04<00:00, 1.66s/it]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
"\n",
"MODEL_NAME = 'NousResearch/Genstruct-7B'\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', load_in_8bit=True)\n",
"tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)"
]
},
{
"cell_type": "markdown",
"id": "34f73db8",
"metadata": {
"jupyter": {
"source_hidden": false
}
},
"source": [
"Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:\n",
"```\n",
"[[[Title]]] p-value\n",
"[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
"\n",
"The following is an interaction between a user and an AI assistant that is related to the above text.\n",
"\n",
"[[[User]]]\n",
"```\n",
"\n",
"The model then completes from `[[[User]]]`, generating an instruction and a response.\n",
"\n",
"\n",
"To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2617d9f5",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
}
},
"outputs": [],
"source": [
"msg =[{\n",
" 'title': 'p-value',\n",
" 'content': \"The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\"\n",
"}]\n",
"inputs = tokenizer.apply_chat_template(msg, return_tensors='pt').cuda()"
]
},
{
"cell_type": "markdown",
"id": "997b8d92",
"metadata": {
"jupyter": {
"source_hidden": false
}
},
"source": [
"Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1429b6bc",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[[Title]]] p-value\n",
"[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
"\n",
"The following is an interaction between a user and an AI assistant that is related to the above text.\n",
"\n",
"[[[User]]] The share prices of two rival companies, A and B, have been monitored for many years, allowing a large number of data points for rigorous statistical analysis. This year's summer, which is known to affect share prices, had two distinct sub-periods, A and B, which were roughly equal in length. The company 'A's share price, during sub-period A, was found to be 2.35, using a test statistic T. The same statistic, for sub-period B, was 1.45.\n",
"Which company, A or B, had a smaller p-value?\n",
"[[[Assistant]]] In the context of statistical analysis, the p-value is a key component of null hypothesis testing. It signifies the probability of obtaining results equal to or more extreme than the observed value of the statistic, under the assumption that the null hypothesis is true. In other words, the lower the p-value is, the less likely the result is if the null hypothesis were true.\n",
"\n",
"In this case, we are comparing the share prices of companies A and B over two distinct sub-periods, A and B, during which the summer had a notable impact on share prices. Using a test statistic T, we found that for sub-period A, the value was 2.35, and for sub-period B, it was 1.45.\n",
"\n",
"When we calculate the p-value for these results, assuming the null hypothesis is true, if we were to get a result as extreme as 2.35 (or more extreme), the probability of that occurring is lower for company A than it is for company B and the statistic 1.45. This means that, all other things being equal, the evidence provided by the data is stronger against the null hypothesis for company A than it is for company B.\n",
"\n",
"Therefore, company A would have a smaller p-value than company B, which means that, based on the data, we would have a lower probability of getting the observed result of 2.35 for company A if the null hypothesis were true. Consequently, the result for company A is a stronger indicator that it's time to reject the null hypothesis.\n",
"\n",
"So, the company with the smaller p-value is A.\n"
]
}
],
"source": [
"gen = tokenizer.decode(model.generate(inputs, max_new_tokens=512)[0]).split(tokenizer.eos_token)[0]\n",
"print(gen)"
]
},
{
"cell_type": "markdown",
"id": "0848af10",
"metadata": {
"jupyter": {
"source_hidden": false
}
},
"source": [
"Note that the model is optimized for single-paragraph extracts from Wikipedia articles. You may have varying luck with other input types.\n",
"\n",
"## Filtering outputs using a reward model\n",
"The model may occasionally generate incorrect or improperly formatted output - the likelihood of this can be reduced with clever sampling methods, such as rejection sampling using a reward model, or even simple regex filtering.\n",
"\n",
"For instance, we might consider `OpenAssistant/reward-model-deberta-v3-large-v2` as a reward model, and perform best-of-n sampling:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a93868ac",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[[Title]]] p-value\n",
"[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
"\n",
"The following is an interaction between a user and an AI assistant that is related to the above text.\n",
"\n",
"[[[User]]] Two medical procedures were compared by flipping 2 coins, procedure A assumed to be better and so it was labeled head, while procedure B was labeled as tail for a flip. The coins where then flipped 25 times, with the following results:[{'Tails', 12}, {'Heads', 13}]\n",
"\n",
"Which procedure had better results with statistical significance?\n",
"[[[Assistant]]] The statistical significance of the outcomes between the two procedures can be assessed using the p-value, which represents the probability of obtaining results as extreme as, or more extreme than, those observed, if the null hypothesis is true.\n",
"\n",
"In this case, let's assume that the null hypothesis would suggest that there is no difference between the two procedures, so each one should result in heads or tails with approximately equal probability (assuming fair coins).\n",
"\n",
"To calculate the p-value, we can use the statistic T, which in this context could be any relevant statistic calculated from the data, such as the difference in the number of flips resulting in heads or tails. We want to find the p-value corresponding to the observed value of T when the data is Tails = 12, Heads\n"
]
}
],
"source": [
"import torch\n",
"from transformers import AutoModelForSequenceClassification\n",
"\n",
"N = 4\n",
"\n",
"rm_tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2')\n",
"rm_model = AutoModelForSequenceClassification.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2', torch_dtype=torch.bfloat16)\n",
"\n",
"def extract_pair(resp):\n",
" response = resp.split('[[[Content]]]')[1]\n",
" inst, resp = resp.split('[[[User]]]')[:2]\n",
" return inst.strip(), resp.strip()\n",
" \n",
"def score(resp):\n",
" inst, resp = extract_pair(resp.split(tokenizer.eos_token)[0])\n",
" \n",
" with torch.no_grad():\n",
" inputs = rm_tokenizer(inst, resp, return_tensors='pt')\n",
" score = float(rm_model(**inputs).logits[0].cpu())\n",
" return score\n",
"\n",
"gens = tokenizer.batch_decode(model.generate(inputs, max_new_tokens=256, num_return_sequences=N, do_sample=True))\n",
"print(max(gens, key=score))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|