How to get up to 4096 context length?
The Mistral-7B-Instruct-v0.1-GGUF card mentions "The model will work at sequence lengths of 4096, or lower."
But when I import the model it only seems to support max context length of 512.
model._llm.context_length --> 512
When I run a larger prompt I get:
WARNING:ctransformers:Number of tokens (850) exceeded maximum context length (512).
How can I utilize the longer context length for the Mistral-7B-Instruct-v0.1-GGUF model?
You have to manually set it up. It’s normally set up for all models ans 512. Also, it should support around 8k context lenght(slightly lower).
You have to manually set it up. It’s normally set up for all models ans 512. Also, it should support around 8k context lenght(slightly lower).
Ok, do you have any suggestions or pointers how to do so?
You can use
pip install llama-cpp-python
wget https://huggingface.co/TheBloke/WizardLM-13B-V1.2-GGUF/resolve/main/wizardlm-13b-v1.2.Q5_K_M.gguf
And after this for example:
from llama_cpp import Llama
llm = Llama(model_path="wizardlm-13b-v1.2.Q5_K_M.gguf", n_ctx=4096, n_gpu_layers=-1)
print(llm(prompt, max_tokens=1024, temperature=0))
Just change name of model and path.
Everyone thanks for the suggestions. Was just pointed to the context_length parameter from ctransformers. Context length is upgraded to 4096 by:
from ctransformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_type="mistral",
gpu_layers=50,
hf=True,
context_length=4096)