RuntimeError: Could not infer dtype of numpy.float32 when converting to PyTorch tensor
Hello,
Thank you for releasing the transformers compatible version for this model. I am trying to run the base inference script provided on the model page. There is just one change, I've added padding=True
to the processor arg. I tried with and without this argument, but the following error still persists. This error is shown for both 8b and 8b-chatty.
System -
Linux clp-a100 6.5.0-26-generic #26~22.04.1-Ubuntu
transformers==4.43.1
torch==2.1.1
numpy==2.0.1
device=Nvidia A100 x4
Error -
Traceback (most recent call last):
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 183, in convert_to_tensors
tensor = as_tensor(value)
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 142, in as_tensor
return torch.tensor(value)
RuntimeError: Could not infer dtype of numpy.float32
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/project/kkoshti/clembench/backends/multimodal_utils/idefics3_utils.py", line 45, in <module>
inputs = processor(text=prompt, images=[image1, image2], padding=True, return_tensors="pt")
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py", line 230, in __call__
image_inputs = self.image_processor(images, return_tensors=return_tensors)
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 41, in __call__
return self.preprocess(images, **kwargs)
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/models/idefics2/image_processing_idefics2.py", line 596, in preprocess
return BatchFeature(data=data, tensor_type=return_tensors)
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 79, in __init__
self.convert_to_tensors(tensor_type=tensor_type)
File "/project/kkoshti/envs/clembench/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 189, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.
CODE -
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
).to(DEVICE)
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do we see in this image?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "And how about this image?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']
Hey, we put a disclaimer on the model card
/!!!!\ WARNING: Idefics2 will NOT work with Transformers version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.
I'm note sure it's related to your bug, but in any case, I might not help or add another silent bug on top of it.
Maybe you can retry with version 4.40 first (or with the fix in Transformers) to see if it helps?