<h1 align=center> Contextual RAG </h1>

![anthropic blog poas](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F2496e7c6fedd7ffaa043895c23a4089638b0c21b-3840x2160.png&w=3840&q=75)

This is an approach proposed by Anthropic in a recent [blog poas](https://www.anthropic.com/news/contextual-retrieval). It involves improving retrieval by providing each document chunk with an in context summary.

<h2 align=center> Problems </h2>

As one may gather from the explanation, there is a requirement that each chunk be appropriately contextualized with respect to the rest of the document. So essentially the whole document has to be passed into the prompt each time along with the chunk. There are two problems with this:

1. This would be very expensive in terms of input token count.
2. For models with smaller context windows, the whole document may exceed it.( Further, there is a sense in which fitting a whole document into a models context width defeats the point of performing RAG.)


<h2 align=center> Whole Document Summarization </h2>

The solution I have come up with is to instead summarize the document into a more manageable size.

<h3 align=center> Refine </h3>

In [2]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_text_splitters import CharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader

In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from langchain.chains.summarize import load_summarize_chain

In [5]:
import os
from dotenv import load_dotenv

if not load_dotenv():
    print("API keys may not have been loaded succesfully")
google_api_key = os.getenv("GOOGLE_API_KEY")

In [6]:
llm = ChatGoogleGenerativeAI(model="gemini-pro", api_key=google_api_key)

In [7]:
loader = PyMuPDFLoader("data/State Machines.pdf")
docs = loader.load()

In [10]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)

In [11]:
prompt = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=prompt, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_template = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]
)

# Load refine chain
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_template,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": split_docs}, return_only_outputs=True)

  result = chain({"input_documents": split_docs}, return_only_outputs=True)
Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 Resource has been exhausted (e.g. check quota)..


ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).

<h3 align=center> Remarks </h3>

Refine is properly configured but we ran into this error.

```python
ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).
```

This is a problem on the part of our llm provider not the code.

<h3 align=center> Next Steps </h3>

The best approach will be to use local models to achive this kind of heavy inference. For that we will turn to either **Ollama** or hugging face **Transformers**.