Model card for ViT-B-16-SigLIP-i18n-256
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.
This model has been converted from Open-CLIP : timm/ViT-B-16-SigLIP-i18n-256 to huggingface CLIPVisionModel
from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)
vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
outputs = vision_tower(**inputs)
logits_per_image = outputs.pooler_output # this is the image-text similarity score
There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.
- Downloads last month
- 13
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.