How to parallelize starcoder inference?
#93
by
Cubby9059
- opened
Hello,
I am trying to deploy starcoder as an internal coding assistant for a team of 100 people. However, the model is taking too long for predictions especially when parallel requests are being made. I am using an Nvidia A100 40GB. Any suggestions on how to make faster inference?
Thank you.
you can try deploying the model with Text Generation Inference library that we use for the inference endpoints