I Found Inference Speed for INT8 Quantized Model is Slower Than Non-Quantized Version

by fliu1998 - opened Jul 26

fliu1998

Jul 26

Hi everyone,

I downloaded the meta-llama/Llama-Guard-3-8B-INT8 model and ran it on my A100 40GB GPU. I noticed that each inference, with the input being a conversation composed of a prompt and response, takes around 4 seconds. However, when I use the meta-llama/Llama-Guard-3-8B model (the non-quantized version), with the same input and hardware, the inference time is around 0.7 seconds.

My environment includes the transformers library version 4.43.1 and torch version 2.3.0, with CUDA version 12.4, which meet the requirements.

Does anyone know why the INT8 quantized model is slower than the non-quantized version?

fliu1998 changed discussion title from Does any one find the inference speed it's slower? to Is the Inference Speed Slower for the INT8 Quantized Model Compared to the Non-Quantized Version? Jul 26

fliu1998 changed discussion title from Is the Inference Speed Slower for the INT8 Quantized Model Compared to the Non-Quantized Version? to I Found Inference Speed for INT8 Quantized Model is Slower Than Non-Quantized Version Jul 26

jamie-de

18 days ago

No I don't know, but am experiencing the same thing. Also, all responses are coming back with like 100 "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" and that's it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment