Quantizations of https://huggingface.co/mistralai/Mistral-Small-Instruct-2409
Inference Clients/UIs
From original readme
Mistral-Small-Instruct-2409 is an instruct fine-tuned version with the following characteristics:
- 22B parameters
- Vocabulary to 32768
- Supports function calling
- 32k sequence length
Usage Examples
vLLM (recommended)
We recommend using this model with the vLLM library to implement production-ready inference pipelines.
Installation
Make sure you install vLLM >= v0.6.1.post1
:
pip install --upgrade vllm
Also make sure you have mistral_common >= 1.4.1
installed:
pip install --upgrade mistral_common
You can also make use of a ready-to-go docker image.
Offline
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Mistral-Small-Instruct-2409"
sampling_params = SamplingParams(max_tokens=8192)
# note that running Mistral-Small on a single GPU requires at least 44 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")
prompt = "How often does the letter r occur in Mistral?"
messages = [
{
"role": "user",
"content": prompt
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Server
You can also use Mistral Small in a server/client setting.
- Spin up a server:
vllm serve mistralai/Mistral-Small-Instruct-2409 --tokenizer_mode mistral --config_format mistral --load_format mistral
Note: Running Mistral-Small on a single GPU requires at least 44 GB of GPU RAM.
If you want to divide the GPU requirement over multiple devices, please add e.g. --tensor_parallel=2
- And ping the client:
curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer token' \
--data '{
"model": "mistralai/Mistral-Small-Instruct-2409",
"messages": [
{
"role": "user",
"content": "How often does the letter r occur in Mistral?"
}
]
}'
Mistral-inference
We recommend using mistral-inference to quickly try out / "vibe-check" the model.
Install
Make sure to have mistral_inference >= 1.4.1
installed.
pip install mistral_inference --upgrade
Download
from huggingface_hub import snapshot_download
from pathlib import Path
mistral_models_path = Path.home().joinpath('mistral_models', '22B-Instruct-Small')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id="mistralai/Mistral-Small-Instruct-2409", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)
Chat
After installing mistral_inference
, a mistral-chat
CLI command should be available in your environment. You can chat with the model using
mistral-chat $HOME/mistral_models/22B-Instruct-Small --instruct --max_tokens 256
Instruct following
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
model = Transformer.from_folder(mistral_models_path)
completion_request = ChatCompletionRequest(messages=[UserMessage(content="How often does the letter r occur in Mistral?")])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)
Function calling
from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
model = Transformer.from_folder(mistral_models_path)
completion_request = ChatCompletionRequest(
tools=[
Tool(
function=Function(
name="get_current_weather",
description="Get the current weather",
parameters={
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location.",
},
},
"required": ["location", "format"],
},
)
)
],
messages=[
UserMessage(content="What's the weather like today in Paris?"),
],
)
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)
Usage in Hugging Face Transformers
You can also use Hugging Face transformers
library to run inference using various chat templates, or fine-tune the model.
Example for inference:
from transformers import LlamaTokenizerFast, MistralForCausalLM
import torch
device = "cuda"
tokenizer = LlamaTokenizerFast.from_pretrained('mistralai/Mistral-Small-Instruct-2409')
tokenizer.pad_token = tokenizer.eos_token
model = MistralForCausalLM.from_pretrained('mistralai/Mistral-Small-Instruct-2409', torch_dtype=torch.bfloat16)
model = model.to(device)
prompt = "How often does the letter r occur in Mistral?"
messages = [
{"role": "user", "content": prompt},
]
model_input = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)
gen = model.generate(model_input, max_new_tokens=150)
dec = tokenizer.batch_decode(gen)
print(dec)
And you should obtain
<s>
[INST]
How often does the letter r occur in Mistral?
[/INST]
To determine how often the letter "r" occurs in the word "Mistral,"
we can simply count the instances of "r" in the word.
The word "Mistral" is broken down as follows:
- M
- i
- s
- t
- r
- a
- l
Counting the "r"s, we find that there is only one "r" in "Mistral."
Therefore, the letter "r" occurs once in the word "Mistral."
</s>
- Downloads last month
- 280