TheBloke/MPT-7B-Instruct-GGML · Support for Langchain Intergration

May 22, 2023

Is it possible to load this quantised model for integration to a Langchain via langchain's HuggingFace Local Pipeline Integration? The original MPT-7B-Instruct could be loaded in a similar fashion.

TheBloke

Owner May 22, 2023

Check out ctransformers. This has LangChain integration and supports CPU inference on these GGML MPT models.

vsns

May 22, 2023

•

edited May 22, 2023

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

yunus-emre

Jun 13, 2023

•

edited Jun 13, 2023

If this code is used with the llama-65B-GGML model, qa_chain.run method is takes a very long time. How to solve this problem?

nicoleds

Jun 20, 2023

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

When trying the code above, it returns OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found for the Ctransformers library.. any way to use Ctransformers without upgrading the GLIBC version?

TheBloke

Owner Jun 20, 2023

@nicoleds Try building from source, which will also enable you to get GPU acceleration if you have CUDA toolkit installed:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

nicoleds

Jun 20, 2023

@nicoleds Try building from source, which will also enable you to get GPU acceleration if you have CUDA toolkit installed:
CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
Thanks for the reply, what if I only have CPU? No available GPU

TheBloke

Owner Jun 20, 2023

Then just leave out the CT_CUBLAS=1 part:

pip install ctransformers --no-binary ctransformers

beejay

Jun 24, 2023

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.

from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)

@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

vsns

Jun 24, 2023

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)
@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

Here you go:

import typer

# 0xVs

from ctransformers.langchain import CTransformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from rich import print
from rich.prompt import Prompt

app = typer.Typer()
device = "cpu"



@app

	.command()
def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
    loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")



@app

	.command()
def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin",
             model_type='mpt',
             embedding_model="sentence-transformers/all-MiniLM-L6-v2",
             search_breadth : int = 5, threads : int = 6, temperature : float = 0.4):
    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    config = {'temperature': temperature, 'threads' : threads}
    llm = CTransformers(model=model_path, model_type=model_type, config=config)
    db = FAISS.load_local("faiss_index", embeddings)
    retriever = db.as_retriever(search_kwargs={"k": search_breadth})
    memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True)
    qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever,
                                               memory=memory, return_source_documents=True)
    while True:
        query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ')
        res = qa({"question": query})
        print("[spring_green4]"+res['answer']+"[/spring_green4]")
        if "source_documents" in res:
            print("\n[italic grey46]References[/italic grey46]:")
            for ref in res["source_documents"]:
                print("> [grey19]" + ref.metadata['source'] + "[/grey19]")

if __name__ == "__main__":
    app()

Some notes:

From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it

rodrigofarias

Jul 9, 2023

My partial code with this model, rest can be referred from langchain and ctranformers docs. It works well.
from langchain.vectorstores import FAISS
from ctransformers.langchain import CTransformers
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceInstructEmbeddings

llm = CTransformers(model='D:\\Ai\\models\\MPT-7B-Instruct-GGML\\mpt-7b-instruct.ggmlv3.q5_0.bin', 
                    model_type='mpt')

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", 
                                                      model_kwargs={"device": "cpu"})

db = FAISS.load_local("faiss_index", instructor_embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff", 
                                  retriever=retriever)
@vsns It would be great if you can share a more complete example code where this works for you. I have been trying your example and others from langchain on many of these models but the responses are non-sensical and/or completely outside the context. Very similar code just works with OpenAI models (ada for embedding and 3.5 turbo as the model) making me wonder if I am doing something wrong or these models are just not capable.

Here you go:

import typer

# 0xVs

from ctransformers.langchain import CTransformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import PDFPlumberLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from rich import print
from rich.prompt import Prompt

app = typer.Typer()
device = "cpu"



@app

	.command()
def import_pdfs(dir: str, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
    loader = DirectoryLoader(dir, glob="./*.pdf", loader_cls=PDFPlumberLoader, show_progress=True)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")



@app

	.command()
def question(model_path: str = "./models/mpt-7b-instruct.ggmlv3.q5_0.bin",
             model_type='mpt',
             embedding_model="sentence-transformers/all-MiniLM-L6-v2",
             search_breadth : int = 5, threads : int = 6, temperature : float = 0.4):
    embeddings = HuggingFaceInstructEmbeddings(model_name=embedding_model, 
                                               model_kwargs={"device": device})
    config = {'temperature': temperature, 'threads' : threads}
    llm = CTransformers(model=model_path, model_type=model_type, config=config)
    db = FAISS.load_local("faiss_index", embeddings)
    retriever = db.as_retriever(search_kwargs={"k": search_breadth})
    memory = ConversationBufferMemory(memory_key="chat_history", output_key="answer", return_messages=True)
    qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever,
                                               memory=memory, return_source_documents=True)
    while True:
        query = Prompt.ask('[bright_yellow]\nQuestion[/bright_yellow] ')
        res = qa({"question": query})
        print("[spring_green4]"+res['answer']+"[/spring_green4]")
        if "source_documents" in res:
            print("\n[italic grey46]References[/italic grey46]:")
            for ref in res["source_documents"]:
                print("> [grey19]" + ref.metadata['source'] + "[/grey19]")

if __name__ == "__main__":
    app()

Some notes:

From my experience (take it with pinch of salt) for QA, creating a good vector data is more important than model (i avoid proprietary systems or models)
I haven't tested code much, and so multiple optimizations are possible. To name a few (different embedding model, use of custom prompt template, configuration tweaks etc)
Currently considering VMware/open-llama-7b-open-instruct with llama-cpp-python, as when I use docs on narrow domains with less text, not getting good results
Ultimately will be planning to have a single static binary (with naive assumption that qdrant can be packed inside it) using Rustformers and falcon-40b-instruct, when the support is available in it

What should be "question" in 'python test.py question'? A string? Another python file?

Data-drone

Jul 18, 2023

I am getting:

AttributeError: 'CTransformers' object has no attribute 'task'

That is appearing due to:

this block of code:

huggingface_pipeline.py:169, in HuggingFacePipeline._call(self, prompt, stop, run_manager)
    162 def _call(
    163     self,
    164     prompt: str,
    165     stop: Optional[List[str]] = None,
    166     run_manager: Optional[CallbackManagerForLLMRun] = None,
    167 ) -> str:
    168     response = self.pipeline(prompt)
--> 169     if self.pipeline.task == "text-generation":
    170         # Text generation return includes the starter text.
    171         text = response[0]["generated_text"][len(prompt) :]
    172     elif self.pipeline.task == "text2text-generation":

It looks like we need to add some sort of pipeline abstraction to ctransformers now?

VaidikML0508

Oct 8, 2023

how can i increase context_length and max_input_seq_token of this MPT quantized model?