Image embeddings are different from the official OpenAI clip model

by eugeneware - opened Jun 21, 2022

Jun 21, 2022

The normalized image embeddings generated by this huggingface version of the CLIP model and the official openai implementation produce different embeddings.

I downloaded the following image: https://thumbs.dreamstime.com/b/lovely-cat-as-domestic-animal-view-pictures-182393057.jpg

I generated image embeddings using this model with the following code:

from transformers import CLIPModel, CLIPProcessor
_model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
_processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14')
img = Image.open('lovely-cat-as-domestic-animal-view-pictures-182393057.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt', padding=True)
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True) 
print(image_embeds[0, :10])

I get:

tensor([-0.0262,  0.0541,  0.0122,  0.0053,  0.0453,  0.0138,  0.0141,  0.0035,
         0.0202, -0.0173])

When I use the official implementation with this code:

import clip
__model, __preprocess = clip.load("ViT-L/14", device='cpu')
with torch.no_grad():
    __image_features = __model.encode_image(__image)
    __image_features /= __image_features.norm(dim=-1, keepdim=True)
print(__image_features[0, :10])

I get:

tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

You can see the that values are similar, but are out by a bit.

If I calculate the cosine similarity / dot product I get:

image_embeds @ image_features.t()
# tensor([[0.9971]])

I get the same result when I load up the official openai weights with the open_clip implementation also.

So, there's some subtle difference here.

I'm running transformers 4.20.0

eugeneware

Jun 21, 2022

•

edited Jun 21, 2022

Actually, I worked it out. The preprocessing is different from the huggingface CLIPProcessor, and the default clip implementations. So the model was getting a slightly different version of the image.

From what I can tell so far, due to different implementations for the center cropping, it's changing pixels.

TL;DR if you need exactly the same input for a given image, then use the openai input processing pipeline like this:

from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
image_processor = Compose([
    Resize(size=224, interpolation=Image.BICUBIC),
    CenterCrop(size=(224, 224)),
    lambda img: img.convert('RGB'),
    ToTensor(),
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
inputs=dict(pixel_values=image_processor(img).unsqueeze(0))
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])

tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

julien-c

Jun 22, 2022

cc @valhalla in case you hadn't seen this!

FusionLi

Sep 16, 2022

I found the text embedding differs quite a lot. Does this make sense?

jashsinghania13

May 21

@eugeneware
I am not able to get consistent results in the HF interface and my local model. I did what you have done but getting different scores
This is my code

Please let me know what is different in this and the HF preprocessing

from PIL import Image
import requests
import torch
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from transformers import CLIPProcessor, CLIPModel

Load the CLIP model and processors

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Load the image from URL

image = Image.open('medicine_mistake_download/3edf2e67-2e49-4ef0-a5d6-fbe224c35bf9.jpg')

Define the image preprocessing pipeline

image_processor = Compose([
Resize(size=224, interpolation=Image.BICUBIC),
CenterCrop(size=(224, 224)),
lambda img: img.convert('RGB'),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])

Preprocess the image

processed_image = image_processor(image).unsqueeze(0)

Get image embeddings

with torch.no_grad():
vision_outputs = model.vision_model(pixel_values=processed_image)
image_embeds = vision_outputs.last_hidden_state
image_embeds = model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)

Print the first 10 dimensions of the first image embedding

print("First 10 dimensions of the image embedding:", image_embeds[0, :10])

Process the text inputs

texts = ["other", "prescription document",'medicine image']
text_inputs = processor(text=texts, return_tensors="pt", padding=True)

Combine text and image inputs

inputs = {
"input_ids": text_inputs["input_ids"],
"attention_mask": text_inputs["attention_mask"],
"pixel_values": processed_image
}

Get the model outputs

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # Image-text similarity scores
probs = logits_per_image.softmax(dim=1) # Probabilities

Print the results

print("Logits per image:", logits_per_image)
print("Probabilities:", probs)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment