Edit model card

Convolutional Vision Transformer (CvT)

CvT-13 model pre-trained on ImageNet-1k at resolution 224x224. It was introduced in the paper CvT: Introducing Convolutions to Vision Transformers by Wu et al. and first released in this repository.

Disclaimer: The team releasing CvT did not write a model card for this model so this model card has been written by the Hugging Face team.

Usage

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import AutoFeatureExtractor, CvtForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained('microsoft/cvt-13')
model = CvtForImageClassification.from_pretrained('microsoft/cvt-13')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
Downloads last month
98,300
Safetensors
Model size
20M params
Tensor type
F32
·
Inference API
Drag image file here or click to browse from your device

Model tree for microsoft/cvt-13

Finetunes
5 models

Dataset used to train microsoft/cvt-13

Spaces using microsoft/cvt-13 2