Non-urgent-micro-issue when running the model locally
First of all, thank you so much for releasing this model so that we developers can learn and experiment with function calling. I'm eager to dive into your evaluation tests, blogposts and the technology this team is developing ππ»β¨
I found that https://colab.research.google.com/drive/19JYixRPPlanmW5q49WYi_tU8rhHeCEKW#scrollTo=0V3AuFLPSCCV&line=2&uniqifier=1 is a good place to start because it doesn't require any deep langchain knowledge and just shows you a basic structure to make simple calls. (But the provided example in https://huggingface.co/Nexusflow/NexusRaven-V2-13B/blob/main/langdemo.py is also super clear)
Anyways, in the colab I linked, the query_raven
method could be replaced by the following code block:
output = requests.post(
"https://rjmy54al17scvxjr.us-east-1.aws.endpoints.huggingface.cloud",
headers={"Content-Type": "application/json"},
json={
"inputs": prompt,
"parameters": {
"temperature": 0.001,
"max_new_tokens": 2000,
"stop": ["<bot_end>"],
"do_sample": False,
},
},
).json()
call = output[0]["generated_text"].replace("Call:", "").strip()
This works out of the box, and allows for local model execution with huggingface's TGI. I however prefer to use llama.cpp as a backend because it's more portable and configurable. In order to do this, you need to run a llama.cpp server, more info on that here and replace your request by the following:
output = requests.post(
"http://localhost:8080/completion",
headers={"Content-Type": "application/json"},
json={
"prompt": prompt,
"temperature": 0.001,
"n_predict": 2000,
"stop": ["<bot_end>", "Thought:"],
"top_k": 1,
"top_p": 1.0,
},
).json()
call = output["content"].replace("Call:", "").strip()
(In case you wanted to use this with langchain, an option would be to sub-class from langchain.llms.base import LLM
to add this POST request. Alternatively, you can use a backend like web-text-ui with the openAI plugin to make use of that llm interface, but that runs llama.cpp)
There is however, one small modification, the "Thought:"
stop word needs to be added, because in my experience running the llama.cpp server locally, even in a deterministic manner, creates different outputs than the huggingface endpoint does. I've been trying out many different parameters to achieve parity with HF's server, but it seems it doesn't matter, just be sure to add the second stop word in case you come across this issue!
As of today this setup is working, if anybody has any feedback on what could be causing the token generation issue it would be most appreciated!!
Also, I have tried this with different quantizations (the largest one being f16 which should have barely any loss) and the same happens, the token "<bot_end>"
is never generated
(most of this discussion is really meant to share how this is working for me at the moment, like, in case others find it helpful! π€π»)