BAAI
/

gsh33's picture
Update README.md
fe4e270 verified
|
raw
history blame
6.36 kB
metadata
license: apache-2.0
language:
  - en
  - zh
tags:
  - multimodal
library_name: transformers
datasets:
  - BAAI/Infinity-MM
  - BAAI/Infinity-Instruct
  - BAAI/Infinity-Preference
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
  - google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text

Introduction

The Aquila-VL-2B model is a vision-language model (VLM) trained based on the LLava-one-vision framework. The Qwen2.5-1.5B-instruct model is chose as the LLM, while siglip-so400m-patch14-384 is utilized as the vision tower.

The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.

We have open-sourced Infinity-MM dataset and related resources. We hope you enjoy using them!

News

Evaluation

We evaluated the model using the VLMEvalKit tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.

Benchmark MiniCPM-V-2 InternVL2-2B XinYuan-VL-2B Qwen2-VL-2B-Instruct Aquila-VL-2B
MMBench-ENtest 69.4 73.4 78.9 74.9 78.8
MMBench-CNtest 65.9 70.9 76.1 73.9 76.4
MMBench_V1.1test 65.2 69.7 75.4 72.7 75.2
MMT-Benchtest 54.5 53.3 57.2 54.8 58.2
RealWorldQA 55.4 57.3 63.9 62.6 63.9
HallusionBench 36.8 38.1 36.0 41.5 43.0
SEEDBench2plus 51.8 60.0 63.0 62.4 63.0
LLaVABench 66.1 64.8 42.4 52.5 68.4
MMStar 41.6 50.2 51.9 47.8 54.9
POPE 86.6 85.3 89.4 88.0 83.6
MMVet 44.0 41.1 42.7 50.7 44.3
MMMUval 39.6 34.9 43.6 41.7 47.4
ScienceQAtest 80.4 94.1 86.6 78.1 95.2
AI2Dtest 64.8 74.4 74.2 74.6 75.0
MathVistatestmini 39.0 45.0 47.1 47.9 59.0
MathVersetestmini 19.8 24.7 22.2 21.0 26.2
MathVision 15.4 12.6 16.3 17.5 18.4
DocVQAtest 71.0 86.9 87.6 89.9 85.0
InfoVQAtest 40.0 59.5 59.1 65.4 58.3
ChartQAtest 59.6 71.4 57.1 73.5 76.5
TextVQAval 74.3 73.5 77.6 79.9 76.4
OCRVQAtestcore 54.4 40.2 67.6 68.7 64.0
VCRen easy 27.6 51.6 67.7 68.3 70.0
OCRBench 613 784 782 810 772
Average 53.5 58.8 60.9 62.1 64.1

For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.

Future Plan

  • We plan to train models of various sizes.
  • Future training will incorporate multi-image and video data.

Citation

If you find this dataset useful, please cite the following work

@misc{gu2024infinitymmscalingmultimodalperformance,
      title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data}, 
      author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
      year={2024},
      eprint={2410.18558},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.18558}, 
}


# Disclaimer
The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced the model is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.[]()