openvla/openvla-v01-7b · Cannot run code snipped with an actual image

Hi, thank you for releasing the model! Very excited to use it. I have a slightly modified version of the code snipped on the model card where I just feed in a (320, 256) size image with some text prompt. However when I run this, I get an error about the channels not matching(see below). Is there a possibility that you can point me to an image that can be fed into the model to get some inference out?

RuntimeError: Given groups=1, weight of size [1152, 3, 14, 14], expected input[1, 6, 224, 224] to have 3 channels, but got 6 channels instead

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
from numpy import asarray

INSTRUCTION = 'pick up the red block and place it on the green block'

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-v01-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image_path = "./your_file.jpeg"	
image = Image.open(image_path)
system_prompt = (
    "A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {INSTRUCTION}? ASSISTANT:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

print(action)