Any plans on when vllm will be supported?

#26

by karlyukang - opened 23 days ago

Discussion

karlyukang

23 days ago

Good work!
Any plans on when vllm will be supported?

schmmd

Ai2 org 23 days ago

https://github.com/vllm-project/vllm/pull/9016

Aryanne

23 days ago

when gguf will support it?

ksmehrab

15 days ago

Not sure if this is the right place to comment this. I seem to be getting very different generated text when running the same variant of Molmo through vllm and through HF. I understand that every generation can be different owing to sampling parameters like temperature, but for example, when asked to check if something is present in an image or not, the vllm model seems to frequently generate false positives, while the HF model consistently generates negatives, for the same prompt.

Are there any specific sampling parameters that we should be using with vllm to make it more consistent with the generations of the HF model?

amanrangapur

Ai2 org 14 days ago

We aren't officially connected with vllm, so can't provide direct insights into its internal workings. Differences in text generation can arise due to various factors like default sampling parameters (e.g., loaded size(in terms of quantization), temperature, top_k, top_p) that might differ between vllm and hf model. It's also possible that vllm handles certain tasks, like image-related queries, differently in terms of model architecture or optimizations. I would recommend ensuring that both vllm and hf use the same sampling parameters—such as temperature, top_k, and top_p—during inference. It might also help to run a few controlled tests across both platforms to see how they handle specific tasks.

amanrangapur

Ai2 org 14 days ago

@Aryanne Releasing of GGUF version is in talks, I will update you if I get any information. :)

nph4rd

14 days ago

@ksmehrab make sure you're using float32 :)

ksmehrab

13 days ago

•

edited 13 days ago

@amanrangapur Thanks. I am ensuring the same GenerationConfig and SamplingParams in HF and vLLM respectively.
@nph4rd Thanks, I do ensure the model is float32 in the LLM initialization. Is there a way to do the same for my inputs (i.e., prompt and multi_modal_data)?

EDIT: Actually, when I turn off sampling (that is, perform greedy decoding), I get the same output in both. That is what I was aiming for. I think the differences I observed may have been due to some other sampling parameters that I may have missed. Thanks for the help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment