Model Card for LoTLIP ViT-B/16

Model Details

Model Description

LoTLIP ViT-B/16 model pre-trained on 100M scale dataset.

Direct Use

Zero-shot long text-image retrieval, short text-image retrieval, and image classification, among others.

How to Get Started with the Model

Use the code to get started with the model.

Training Details

Training Data

The models are trained with 100M scale dataset which contains long text-image pairs.

Evaluation

Please refer to https://github.com/wuw2019/LoTLIP.

Testing Details

Testing Data

The testing is performed with DCI, IIW and ShareGPT4V for long text-image retrieval and ImageNet1k for classification.

Results

Model	Pre-training Data Scale	DCI I2T	DCI T2I	IIW I2T	IIW T2I	SV-10k I2T	SV-10k T2I
LoTLIP-ViT-B-16	100M	64.11	62.63	94.28	92.65	88.40	82.72

Citation

BibTeX:

@inproceedings{LoTLIP,
  title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
  author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
  booktitle={arXiv},
  year={2024}
}