Any plans on when vllm will be supported?

#26
by karlyukang - opened

Good work!
Any plans on when vllm will be supported?

when gguf will support it?

Not sure if this is the right place to comment this. I seem to be getting very different generated text when running the same variant of Molmo through vllm and through HF. I understand that every generation can be different owing to sampling parameters like temperature, but for example, when asked to check if something is present in an image or not, the vllm model seems to frequently generate false positives, while the HF model consistently generates negatives, for the same prompt.

Are there any specific sampling parameters that we should be using with vllm to make it more consistent with the generations of the HF model?

We aren't officially connected with vllm, so can't provide direct insights into its internal workings. Differences in text generation can arise due to various factors like default sampling parameters (e.g., loaded size(in terms of quantization), temperature, top_k, top_p) that might differ between vllm and hf model. It's also possible that vllm handles certain tasks, like image-related queries, differently in terms of model architecture or optimizations. I would recommend ensuring that both vllm and hf use the same sampling parameters—such as temperature, top_k, and top_p—during inference. It might also help to run a few controlled tests across both platforms to see how they handle specific tasks.

@Aryanne Releasing of GGUF version is in talks, I will update you if I get any information. :)

@ksmehrab make sure you're using float32 :)

@amanrangapur Thanks. I am ensuring the same GenerationConfig and SamplingParams in HF and vLLM respectively.
@nph4rd Thanks, I do ensure the model is float32 in the LLM initialization. Is there a way to do the same for my inputs (i.e., prompt and multi_modal_data)?

EDIT: Actually, when I turn off sampling (that is, perform greedy decoding), I get the same output in both. That is what I was aiming for. I think the differences I observed may have been due to some other sampling parameters that I may have missed. Thanks for the help.

Sign up or log in to comment