Consistency of results.
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "llava-hf/llava-1.5-7b-hf"
prompt = "USER:\n Describe what is going on in image? <image>\nASSISTANT:"
image_file = "upload/00000230.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(image_file)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
Using the code above, I attempted to generate captions for different images and observed inconsistent results for the same images and prompts when making multiple requests.
Here's what I did:
I made a request based on the provided code, and the output was quite solid. It is a description of one of old sacred pieces of art (ASSISTANT: The image features a painting of Jesus on a cross, surrounded by a group of people. The painting is set against a blue background, and the people are depicted in various positions, some standing and others kneeling. There are at least five people in the scene, with one person standing close to the left side of the painting, another person standing near the center, and three more people positioned around the right side of the painting. The scene captures the emotions and reactions of the people as they witness the crucifixion of Jesus.)
I made a request using the same prompt but with a different image (without updating the model). Output is quite generic (ASSISTANT: The image features a close-up of a textured, patterned wall with a mix of green and brown colors. The wall appears to be made of a combination of wood and fabric, creating a unique and visually interesting texture. The close-up view of the wall highlights the intricate patterns and textures, making it a captivating focal point in the scene.)
I made a request based on the first file. The output was completely different and not relevant. (ASSISTANT: The image features a large, colorful, and abstract painting of a mountain. The mountain is depicted in a vibrant and artistic manner, with a mix of colors and patterns. The painting is displayed on a wall, and it appears to be a part of a larger artwork or a series of paintings. The mountain's shape and colors create a visually striking and engaging piece of art.)
Given the nature of such models, I find this behavior strange. Even considering the probabilistic generation of output tokens, the same inputs should ideally produce similar outputs.
Do you have any insights into why this happens and suggestions on how to address and fix this issue?
Hey! I tried the script you provided and got identical results for the same image
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "llava-hf/llava-1.5-7b-hf"
prompt = "USER:\n Describe what is going on in image? <image>\nASSISTANT:"
image_file = "/raid/raushan/LLaVA-NeXT/playground/demo/examples/dog1.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(image_file)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image_2 = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
inputs = processor(prompt, image_2, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
Given that you got non-sensical results on the third run, can you check that the image you're passing is the same and verify that inputs from first run are identical to inputs from the third run?
If you still see the error, share a reproduction code and your Transformers version plz