Transformers
mpt
Composer
MosaicML
llm-foundry
text-generation-inference

Support for Langchain Intergration

#2
by kiran2405 - opened

Is it possible to load this quantised model for integration to a Langchain via langchain's HuggingFace Local Pipeline Integration? The original MPT-7B-Instruct could be loaded in a similar fashion.

Check out ctransformers. This has LangChain integration and supports CPU inference on these GGML MPT models.

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

If this code is used with the llama-65B-GGML model, qa_chain.run method is takes a very long time. How to solve this problem?

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

When trying the code above, it returns OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found for the Ctransformers library.. any way to use Ctransformers without upgrading the GLIBC version?

@nicoleds Try building from source, which will also enable you to get GPU acceleration if you have CUDA toolkit installed:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

@nicoleds Try building from source, which will also enable you to get GPU acceleration if you have CUDA toolkit installed:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

Thanks for the reply, what if I only have CPU? No available GPU

Then just leave out the CT_CUBLAS=1 part:

pip install ctransformers --no-binary ctransformers

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

Here you go:

import typer

# 0xVs

from ctransformers.langchain import CTransformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from rich import print
from rich.prompt import Prompt

app = typer.Typer()
device = "cpu"



@app

	.command()
def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
    loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")



@app

	.command()
def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin",
             model_type='mpt',
             embedding_model="sentence-transformers/all-MiniLM-L6-v2",
             search_breadth : int = 5, threads : int = 6, temperature : float = 0.4):
    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    config = {'temperature': temperature, 'threads' : threads}
    llm = CTransformers(model=model_path, model_type=model_type, config=config)
    db = FAISS.load_local("faiss_index", embeddings)
    retriever = db.as_retriever(search_kwargs={"k": search_breadth})
    memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True)
    qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever,
                                               memory=memory, return_source_documents=True)
    while True:
        query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ')
        res = qa({"question": query})
        print("[spring_green4]"+res['answer']+"[/spring_green4]")
        if "source_documents" in res:
            print("\n[italic grey46]References[/italic grey46]:")
            for ref in res["source_documents"]:
                print("> [grey19]" + ref.metadata['source'] + "[/grey19]")

if __name__ == "__main__":
    app()

Some notes:

  1. From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
  2. I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
  3. Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
  4. Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it

docQA.png

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

Here you go:

import typer

# 0xVs

from ctransformers.langchain import CTransformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from rich import print
from rich.prompt import Prompt

app = typer.Typer()
device = "cpu"



@app

	.command()
def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
    loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")



@app

	.command()
def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin",
             model_type='mpt',
             embedding_model="sentence-transformers/all-MiniLM-L6-v2",
             search_breadth : int = 5, threads : int = 6, temperature : float = 0.4):
    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    config = {'temperature': temperature, 'threads' : threads}
    llm = CTransformers(model=model_path, model_type=model_type, config=config)
    db = FAISS.load_local("faiss_index", embeddings)
    retriever = db.as_retriever(search_kwargs={"k": search_breadth})
    memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True)
    qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever,
                                               memory=memory, return_source_documents=True)
    while True:
        query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ')
        res = qa({"question": query})
        print("[spring_green4]"+res['answer']+"[/spring_green4]")
        if "source_documents" in res:
            print("\n[italic grey46]References[/italic grey46]:")
            for ref in res["source_documents"]:
                print("> [grey19]" + ref.metadata['source'] + "[/grey19]")

if __name__ == "__main__":
    app()

Some notes:

  1. From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
  2. I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
  3. Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
  4. Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it

docQA.png

What should be "question" in 'python test.py question'? A string? Another python file?

I am getting:

AttributeError: 'CTransformers' object has no attribute 'task'

That is appearing due to:

this block of code:

huggingface_pipeline.py:169, in HuggingFacePipeline._call(self, prompt, stop, run_manager)
    162 def _call(
    163     self,
    164     prompt: str,
    165     stop: Optional[List[str]] = None,
    166     run_manager: Optional[CallbackManagerForLLMRun] = None,
    167 ) -> str:
    168     response = self.pipeline(prompt)
--> 169     if self.pipeline.task == "text-generation":
    170         # Text generation return includes the starter text.
    171         text = response[0]["generated_text"][len(prompt) :]
    172     elif self.pipeline.task == "text2text-generation":

It looks like we need to add some sort of pipeline abstraction to ctransformers now?

how can i increase context_length and max_input_seq_token of this MPT quantized model?

Sign up or log in to comment