Does Llama-3.2 Vision model support MultiImages?

#43
by JOJOHuang - opened

Does this model support Multi Images? if True,like this?

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Meta Llama org

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply!

Sanyam changed discussion status to closed

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

I am cuda out of memory message when i use multiple images

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.

Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?

Sign up or log in to comment