Batch inputs (image, prompt)

#10
by jeeyungk - opened

Can we use a batch of image as an input to LLaVA?

Llava Hugging Face org

Hi! Yes Llava-1.5 can take batched inputs, see the code snippet below:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompts = [
        "USER: <image>\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
        "USER: <image>\nWhat is this? ASSISTANT:",
  ]
 
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)
print(output)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Llava Hugging Face org

Hi,

You need to place the inputs on the GPU as well, so the snippet above needs to add:

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")

Why can it only recognize the first picture and not reply to the two pictures?

Llava Hugging Face org

@ZIHANGDU18 the models was not trained with multi-image setting and thus may perform poorly without proper fine-tuning. Try out the new llava series, tuned with multi-image dataset :)

https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19

Sign up or log in to comment