I Found Inference Speed for INT8 Quantized Model is Slower Than Non-Quantized Version
Hi everyone,
I downloaded the meta-llama/Llama-Guard-3-8B-INT8
model and ran it on my A100 40GB GPU. I noticed that each inference, with the input being a conversation composed of a prompt and response, takes around 4 seconds. However, when I use the meta-llama/Llama-Guard-3-8B
model (the non-quantized version), with the same input and hardware, the inference time is around 0.7 seconds.
My environment includes the transformers
library version 4.43.1 and torch
version 2.3.0, with CUDA version 12.4, which meet the requirements.
Does anyone know why the INT8 quantized model is slower than the non-quantized version?
No I don't know, but am experiencing the same thing. Also, all responses are coming back with like 100 "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" and that's it.