Cannot run code snipped with an actual image
#1
by
kkatodus
- opened
Hi, thank you for releasing the model! Very excited to use it. I have a slightly modified version of the code snipped on the model card where I just feed in a (320, 256) size image with some text prompt. However when I run this, I get an error about the channels not matching(see below). Is there a possibility that you can point me to an image that can be fed into the model to get some inference out?
RuntimeError: Given groups=1, weight of size [1152, 3, 14, 14], expected input[1, 6, 224, 224] to have 3 channels, but got 6 channels instead
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
from numpy import asarray
INSTRUCTION = 'pick up the red block and place it on the green block'
# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-v01-7b",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image_path = "./your_file.jpeg"
image = Image.open(image_path)
system_prompt = (
"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {INSTRUCTION}? ASSISTANT:"
# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)