Does Llama-3.2 Vision model support MultiImages?

#43

by JOJOHuang - opened Sep 29

Sep 29

Does this model support Multi Images？ if True，like this？

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Sanyam

Meta Llama org Sep 29

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

JOJOHuang

Sep 30

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply！

Sanyam changed discussion status to closed Oct 3

h3045

Oct 16

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

danmir

Oct 22

I am cuda out of memory message when i use multiple images

sraliu

Oct 29

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

globalinnovationhub

6 days ago

I have the same question, I am trying to infer video files, extracting frames and transcripts to infer the video on a whole. However, an accumulation of frame understanding is needed instead of single frame inferencing. Llama3.2 vision is unable to do this it seems.

ascension-hf

3 days ago

Same here, I would also like to have multi image support in 1 conversation. What is the ETA on this? Will it be supported in the future?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment