metadata
language: ja
license: apache-2.0
tags:
- clip
- japanese-clip
pipeline_tag: feature-extraction
clip-japanese-base
This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.
How to use
- Install packages
pip install pillow requests sentencepiece transformers torch timm
- Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt")
text = tokenizer(["犬", "猫", "象"])
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1., 0., 0.]]
Model architecture
The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.
Evaluation
Dataset
- STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
- Recruit Datasets for image classification.
- ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in
ja-imagenet-1k-classnames.txt
andja-imagenet-1k-templates.txt
.
Result
Model | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) |
---|---|---|---|---|---|
Ours | 86M(Eva02-B) | 100M(BERT) | 0.30 | 0.89 | 0.58 |
Stable-ja-clip | 307M(ViT-L) | 100M(BERT) | 0.24 | 0.77 | 0.68 |
Rinna-ja-clip | 86M(ViT-B) | 100M(BERT) | 0.13 | 0.54 | 0.56 |
Laion-clip | 632M(ViT-H) | 561M(XLM-RoBERTa) | 0.30 | 0.83 | 0.58 |
Hakuhodo-ja-clip | 632M(ViT-H) | 100M(BERT) | 0.21 | 0.82 | 0.46 |
Licenses
The Apache License, Version 2.0
Citation
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo, Shuntaro Okada, Peifei Zhu, Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}