Help: How to inference with this model in FP8
I rewriting the code in the Instruct following section of the Mistral Inference as follows and executed it in Google Colab.
import torch # Added
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, dtype=torch.float8_e4m3fn) # Changed
prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."
completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])
print(result)
When the model was loaded, VRAM usage was 11.6GB, but the following error occurs in the generate():
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2262 # remove once script supports set_grad_enabled
2263 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2264 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2265
2266
RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
- mistral_inference==1.3.1
- torch==2.3.1+cu121
- safetensors==0.4.3
Am I doing something wrong? If I could inference in FP8, it would fit in my 16GB VRAM. I would really appreciate if someone could teach me how to inference in FP8.
I have the same problem. I can load the model in FP8 with transformers, but it does not work with mistral_inference.
Is there anyone knows about how to loading model with mistral_inference and FP8?
I have the same problem. I can load the model in FP8 with transformers, but it does not work with mistral_inference.
Is there anyone knows about how to loading model with mistral_inference and FP8?
What GPU are you using if you can load the quantized model with the transformers lib? A100 or upwards?