I Found Inference Speed for INT8 Quantized Model is Slower Than Non-Quantized Version

#9
by fliu1998 - opened

Hi everyone,

I downloaded the meta-llama/Llama-Guard-3-8B-INT8 model and ran it on my A100 40GB GPU. I noticed that each inference, with the input being a conversation composed of a prompt and response, takes around 4 seconds. However, when I use the meta-llama/Llama-Guard-3-8B model (the non-quantized version), with the same input and hardware, the inference time is around 0.7 seconds.

My environment includes the transformers library version 4.43.1 and torch version 2.3.0, with CUDA version 12.4, which meet the requirements.

Does anyone know why the INT8 quantized model is slower than the non-quantized version?

fliu1998 changed discussion title from Does any one find the inference speed it's slower? to Is the Inference Speed Slower for the INT8 Quantized Model Compared to the Non-Quantized Version?
fliu1998 changed discussion title from Is the Inference Speed Slower for the INT8 Quantized Model Compared to the Non-Quantized Version? to I Found Inference Speed for INT8 Quantized Model is Slower Than Non-Quantized Version

No I don't know, but am experiencing the same thing. Also, all responses are coming back with like 100 "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" and that's it.

Sign up or log in to comment