Spaces:

LVKinyanjui
/

QueryYourDocs

Sleeping

File size: 11,634 Bytes

5effe6a

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval\n",
    "\n",
    "This notebook shows how to use an implementation of RAPTOR with llama-index, leveraging the RAPTOR llama-pack.\n",
    "\n",
    "RAPTOR works by recursively clustering and summarizing clusters in layers for retrieval.\n",
    "\n",
    "There two retrieval modes:\n",
    "- tree_traversal -- traversing the tree of clusters, performing top-k at each level in the tree.\n",
    "- collapsed -- treat the entire tree as a giant pile of nodes, perform simple top-k.\n",
    "\n",
    "See [the paper](https://arxiv.org/abs/2401.18059) for full algorithm details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index llama-index-packs-raptor llama-index-vector-stores-qdrant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.packs.raptor import RaptorPack\n",
    "\n",
    "# optionally download the pack to inspect/modify it yourself!\n",
    "# from llama_index.core.llama_pack import download_llama_pack\n",
    "# RaptorPack = download_llama_pack(\"RaptorPack\", \"./raptor_pack\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.\n",
      "ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.\n",
      "--2024-02-29 22:16:11--  https://arxiv.org/pdf/2401.18059.pdf\n",
      "Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.131.42, ...\n",
      "Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2547113 (2.4M) [application/pdf]\n",
      "Saving to: ‘./raptor_paper.pdf’\n",
      "\n",
      "./raptor_paper.pdf  100%[===================>]   2.43M  12.5MB/s    in 0.2s    \n",
      "\n",
      "2024-02-29 22:16:12 (12.5 MB/s) - ‘./raptor_paper.pdf’ saved [2547113/2547113]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://arxiv.org/pdf/2401.18059.pdf -O ./raptor_paper.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constructing the Clusters/Hierarchy Tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(input_files=[\"./raptor_paper.pdf\"]).load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating embeddings for level 0.\n",
      "Performing clustering for level 0.\n",
      "Generating summaries for level 0 with 10 clusters.\n",
      "Level 0 created summaries/clusters: 10\n",
      "Generating embeddings for level 1.\n",
      "Performing clustering for level 1.\n",
      "Generating summaries for level 1 with 1 clusters.\n",
      "Level 1 created summaries/clusters: 1\n",
      "Generating embeddings for level 2.\n",
      "Performing clustering for level 2.\n",
      "Generating summaries for level 2 with 1 clusters.\n",
      "Level 2 created summaries/clusters: 1\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.vector_stores.chroma import ChromaVectorStore\n",
    "import chromadb\n",
    "\n",
    "client = chromadb.PersistentClient(path=\"./raptor_paper_db\")\n",
    "collection = client.get_or_create_collection(\"raptor\")\n",
    "\n",
    "vector_store = ChromaVectorStore(chroma_collection=collection)\n",
    "\n",
    "raptor_pack = RaptorPack(\n",
    "    documents,\n",
    "    embed_model=OpenAIEmbedding(\n",
    "        model=\"text-embedding-3-small\"\n",
    "    ),  # used for embedding clusters\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1),  # used for generating summaries\n",
    "    vector_store=vector_store,  # used for storage\n",
    "    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed\n",
    "    mode=\"collapsed\",  # sets default mode\n",
    "    transformations=[\n",
    "        SentenceSplitter(chunk_size=400, chunk_overlap=50)\n",
    "    ],  # transformations applied for ingestion\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n",
      "Specifically, RAPTOR’s F-1 scores are at least 1.8% points higher than DPR and at least 5.3% points\n",
      "higher than BM25.\n",
      "Retriever GPT-3 F-1 Match GPT-4 F-1 Match UnifiedQA F-1 Match\n",
      "Title + Abstract 25.2 22.2 17.5\n",
      "BM25 46.6 50.2 26.4\n",
      "DPR 51.3 53.0 32.1\n",
      "RAPTOR 53.1 55.7 36.6\n",
      "Table 4: Comparison of accuracies on the QuAL-\n",
      "ITY dev dataset for two different language mod-\n",
      "els (GPT-3, UnifiedQA 3B) using various retrieval\n",
      "methods. RAPTOR outperforms the baselines of\n",
      "BM25 and DPR by at least 2.0% in accuracy.\n",
      "Model GPT-3 Acc. UnifiedQA Acc.\n",
      "BM25 57.3 49.9\n",
      "DPR 60.4 53.9\n",
      "RAPTOR 62.4 56.6\n",
      "Table 5: Results on F-1 Match scores of various\n",
      "models on the QASPER dataset.\n",
      "Model F-1 Match\n",
      "LongT5 XL (Guo et al., 2022) 53.1\n",
      "CoLT5 XL (Ainslie et al., 2023) 53.9\n",
      "RAPTOR + GPT-4 55.7Comparison to State-of-the-art Systems\n",
      "Building upon our controlled comparisons,\n",
      "we examine RAPTOR’s performance relative\n",
      "to other state-of-the-art models.\n"
     ]
    }
   ],
   "source": [
    "nodes = raptor_pack.run(\"What baselines is raptor compared against?\", mode=\"collapsed\")\n",
    "print(len(nodes))\n",
    "print(nodes[0].text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Retrieved parent IDs from level 2: ['cc3b3f41-f4ca-4020-b11f-be7e0ce04c4f']\n",
      "Retrieved 1 from parents at level 2.\n",
      "Retrieved parent IDs from level 1: ['a4ca9426-a312-4a01-813a-c9b02aefc7e8']\n",
      "Retrieved 2 from parents at level 1.\n",
      "Retrieved parent IDs from level 0: ['63126782-2778-449f-99c0-1e8fd90caa36', 'd8f68d31-d878-41f1-aeb6-a7dde8ed5143']\n",
      "Retrieved 4 from parents at level 0.\n",
      "4\n",
      "Specifically, RAPTOR’s F-1 scores are at least 1.8% points higher than DPR and at least 5.3% points\n",
      "higher than BM25.\n",
      "Retriever GPT-3 F-1 Match GPT-4 F-1 Match UnifiedQA F-1 Match\n",
      "Title + Abstract 25.2 22.2 17.5\n",
      "BM25 46.6 50.2 26.4\n",
      "DPR 51.3 53.0 32.1\n",
      "RAPTOR 53.1 55.7 36.6\n",
      "Table 4: Comparison of accuracies on the QuAL-\n",
      "ITY dev dataset for two different language mod-\n",
      "els (GPT-3, UnifiedQA 3B) using various retrieval\n",
      "methods. RAPTOR outperforms the baselines of\n",
      "BM25 and DPR by at least 2.0% in accuracy.\n",
      "Model GPT-3 Acc. UnifiedQA Acc.\n",
      "BM25 57.3 49.9\n",
      "DPR 60.4 53.9\n",
      "RAPTOR 62.4 56.6\n",
      "Table 5: Results on F-1 Match scores of various\n",
      "models on the QASPER dataset.\n",
      "Model F-1 Match\n",
      "LongT5 XL (Guo et al., 2022) 53.1\n",
      "CoLT5 XL (Ainslie et al., 2023) 53.9\n",
      "RAPTOR + GPT-4 55.7Comparison to State-of-the-art Systems\n",
      "Building upon our controlled comparisons,\n",
      "we examine RAPTOR’s performance relative\n",
      "to other state-of-the-art models.\n"
     ]
    }
   ],
   "source": [
    "nodes = raptor_pack.run(\n",
    "    \"What baselines is raptor compared against?\", mode=\"tree_traversal\"\n",
    ")\n",
    "print(len(nodes))\n",
    "print(nodes[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading\n",
    "\n",
    "Since we saved to a vector store, we can also use it again! (For local vector stores, there is a `persist` and `from_persist_dir` method on the retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.packs.raptor import RaptorRetriever\n",
    "\n",
    "retriever = RaptorRetriever(\n",
    "    [],\n",
    "    embed_model=OpenAIEmbedding(\n",
    "        model=\"text-embedding-3-small\"\n",
    "    ),  # used for embedding clusters\n",
    "    llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1),  # used for generating summaries\n",
    "    vector_store=vector_store,  # used for storage\n",
    "    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed\n",
    "    mode=\"tree_traversal\",  # sets default mode\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if using a default vector store\n",
    "# retriever.persist(\"./persist\")\n",
    "# retriever = RaptorRetriever.from_persist_dir(\"./persist\", ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "\n",
    "query_engine = RetrieverQueryEngine.from_args(\n",
    "    retriever, llm=OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What baselines was RAPTOR compared against?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BM25 and DPR\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index-4aB9_5sa-py3.10",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}