File size: 15,914 Bytes

0bf7deb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cbe0f126",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "# Introducing Genstruct\n",
    "Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.\n",
    "\n",
    "Two previous methods aimed to improve upon this naive prompting approach:\n",
    "- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.\n",
    "- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.\n",
    "\n",
    "Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations.  To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.\n",
    "\n",
    "Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf417800",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "## Generating instruction pairs\n",
    "Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.\n",
    "\n",
    "Like any other Mistral model, it can be imported from Huggingface Hub as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7492d81a",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/user/.conda/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:  33%|███▎      | 1/3 [00:01<00:03,  1.75s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards:  67%|██████▋   | 2/3 [00:03<00:01,  1.72s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.64s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.66s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "MODEL_NAME = 'NousResearch/Genstruct-7B'\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', load_in_8bit=True)\n",
    "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34f73db8",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:\n",
    "```\n",
    "[[[Title]]] p-value\n",
    "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
    "\n",
    "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
    "\n",
    "[[[User]]]\n",
    "```\n",
    "\n",
    "The model then completes from `[[[User]]]`, generating an instruction and a response.\n",
    "\n",
    "\n",
    "To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2617d9f5",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "msg =[{\n",
    "    'title': 'p-value',\n",
    "    'content': \"The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\"\n",
    "}]\n",
    "inputs = tokenizer.apply_chat_template(msg, return_tensors='pt').cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997b8d92",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1429b6bc",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[Title]]] p-value\n",
      "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
      "\n",
      "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
      "\n",
      "[[[User]]]  The share prices of two rival companies, A and B, have been monitored for many years, allowing a large number of data points for rigorous statistical analysis. This year's summer, which is known to affect share prices, had two distinct sub-periods, A and B, which were roughly equal in length. The company 'A's share price, during sub-period A, was found to be 2.35, using a test statistic T. The same statistic, for sub-period B, was 1.45.\n",
      "Which company, A or B, had a smaller p-value?\n",
      "[[[Assistant]]] In the context of statistical analysis, the p-value is a key component of null hypothesis testing. It signifies the probability of obtaining results equal to or more extreme than the observed value of the statistic, under the assumption that the null hypothesis is true. In other words, the lower the p-value is, the less likely the result is if the null hypothesis were true.\n",
      "\n",
      "In this case, we are comparing the share prices of companies A and B over two distinct sub-periods, A and B, during which the summer had a notable impact on share prices. Using a test statistic T, we found that for sub-period A, the value was 2.35, and for sub-period B, it was 1.45.\n",
      "\n",
      "When we calculate the p-value for these results, assuming the null hypothesis is true, if we were to get a result as extreme as 2.35 (or more extreme), the probability of that occurring is lower for company A than it is for company B and the statistic 1.45. This means that, all other things being equal, the evidence provided by the data is stronger against the null hypothesis for company A than it is for company B.\n",
      "\n",
      "Therefore, company A would have a smaller p-value than company B, which means that, based on the data, we would have a lower probability of getting the observed result of 2.35 for company A if the null hypothesis were true. Consequently, the result for company A is a stronger indicator that it's time to reject the null hypothesis.\n",
      "\n",
      "So, the company with the smaller p-value is A.\n"
     ]
    }
   ],
   "source": [
    "gen = tokenizer.decode(model.generate(inputs, max_new_tokens=512)[0]).split(tokenizer.eos_token)[0]\n",
    "print(gen)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0848af10",
   "metadata": {
    "jupyter": {
     "source_hidden": false
    }
   },
   "source": [
    "Note that the model is optimized for single-paragraph extracts from Wikipedia articles. You may have varying luck with other input types.\n",
    "\n",
    "## Filtering outputs using a reward model\n",
    "The model may occasionally generate incorrect or improperly formatted output - the likelihood of this can be reduced with clever sampling methods, such as rejection sampling using a reward model, or even simple regex filtering.\n",
    "\n",
    "For instance, we might consider `OpenAssistant/reward-model-deberta-v3-large-v2` as a reward model, and perform best-of-n sampling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a93868ac",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[Title]]] p-value\n",
      "[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.\n",
      "\n",
      "The following is an interaction between a user and an AI assistant that is related to the above text.\n",
      "\n",
      "[[[User]]]  Two medical procedures were compared by flipping 2 coins, procedure A assumed to be better and so it was labeled head, while procedure B was labeled as tail for a flip. The coins where then flipped 25 times, with the following results:[{'Tails', 12}, {'Heads', 13}]\n",
      "\n",
      "Which procedure had better results with statistical significance?\n",
      "[[[Assistant]]] The statistical significance of the outcomes between the two procedures can be assessed using the p-value, which represents the probability of obtaining results as extreme as, or more extreme than, those observed, if the null hypothesis is true.\n",
      "\n",
      "In this case, let's assume that the null hypothesis would suggest that there is no difference between the two procedures, so each one should result in heads or tails with approximately equal probability (assuming fair coins).\n",
      "\n",
      "To calculate the p-value, we can use the statistic T, which in this context could be any relevant statistic calculated from the data, such as the difference in the number of flips resulting in heads or tails. We want to find the p-value corresponding to the observed value of T when the data is Tails = 12, Heads\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import AutoModelForSequenceClassification\n",
    "\n",
    "N = 4\n",
    "\n",
    "rm_tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2')\n",
    "rm_model = AutoModelForSequenceClassification.from_pretrained('OpenAssistant/reward-model-deberta-v3-large-v2', torch_dtype=torch.bfloat16)\n",
    "\n",
    "def extract_pair(resp):\n",
    "    response = resp.split('[[[Content]]]')[1]\n",
    "    inst, resp = resp.split('[[[User]]]')[:2]\n",
    "    return inst.strip(), resp.strip()\n",
    "    \n",
    "def score(resp):\n",
    "    inst, resp = extract_pair(resp.split(tokenizer.eos_token)[0])\n",
    "    \n",
    "    with torch.no_grad():\n",
    "        inputs = rm_tokenizer(inst, resp, return_tensors='pt')\n",
    "        score = float(rm_model(**inputs).logits[0].cpu())\n",
    "        return score\n",
    "\n",
    "gens = tokenizer.batch_decode(model.generate(inputs, max_new_tokens=256, num_return_sequences=N, do_sample=True))\n",
    "print(max(gens, key=score))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}