inference and generation runtime - how to reduce latency
#38
by
wamozart
- opened
Hi all,
This is a great model but I was wondering how can I speed up the inference time. My app is running such that it accepts two image + text and create a comparison of them.
Running on EC2 g4.2xl, inference and response time is about 5-6 seconds. I've tried a new generation gpu (H family) but didn't see any improvements (which is kind of weird). I also tried to load in 4bit but had some issues. The only improvement I saw is when I used onnxruntime-genai, (https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda) but unfortunately, the implementation doesn't allow multiple images as inputs.
Would be happy for your suggestions.
Tnx