File size: 6,324 Bytes
ef65714 0c052b0 ef65714 0c052b0 ef65714 93c4246 ef65714 93c4246 ef65714 0c052b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
license: apache-2.0
language:
- en
tags:
- multimodal
library_name: transformers
---
# Introduction
The **Aquila-VL-2B** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
We have open-sourced the [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and trained the [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model on NVIDIA GPUs using this dataset.
The Aquila-VL-2B-CG model in this repository was trained using different GPUs and will be open-sourced soon.
## News
- `2024/10/25`: The [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model and [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset are now available, and the Aquila-VL-2B-CG model for opensource is coming soon. We have also released the [technical report](https://arxiv.org/abs/2410.18558) simultaneously.
# Evaluation
We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
| Benchmark | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Aquila-VL-2B | Aquila-VL-2B-CG |
| :--------------------------- | :---------: | :----------: | :-----------: | :------------------: | :----------: | :-------------: |
| MMBench-EN<sub>test</sub> | 69.4 | 73.4 | **78.9** | 74.9 | 78.8 | |
| MMBench-CN<sub>test</sub> | 65.9 | 70.9 | 76.1 | 73.9 | **76.4** | |
| MMBench_V1.1<sub>test</sub> | 65.2 | 69.7 | **75.4** | 72.7 | 75.2 | |
| MMT-Bench<sub>test</sub> | 54.5 | 53.3 | 57.2 | 54.8 | **58.2** | |
| RealWorldQA | 55.4 | 57.3 | 63.9 | 62.6 | **63.9** | |
| HallusionBench | 36.8 | 38.1 | 36.0 | 41.5 | **43.0** | |
| SEEDBench2<sub>plus</sub> | 51.8 | 60.0 | 63.0 | 62.4 | **63.0** | |
| LLaVABench | 66.1 | 64.8 | 42.4 | 52.5 | **68.4** | |
| MMStar | 41.6 | 50.2 | 51.9 | 47.8 | **54.9** | |
| POPE | 86.6 | 85.3 | **89.4** | 88.0 | 83.6 | |
| MMVet | 44.0 | 41.1 | 42.7 | **50.7** | 44.3 | |
| MMMU<sub>val</sub> | 39.6 | 34.9 | 43.6 | 41.7 | **47.4** | |
| ScienceQA<sub>test</sub> | 80.4 | 94.1 | 86.6 | 78.1 | **95.2** | |
| AI2D<sub>test</sub> | 64.8 | 74.4 | 74.2 | 74.6 | **75.0** | |
| MathVista<sub>testmini</sub> | 39.0 | 45.0 | 47.1 | 47.9 | **59.0** | |
| MathVerse<sub>testmini</sub> | 19.8 | 24.7 | 22.2 | 21.0 | **26.2** | |
| MathVision | 15.4 | 12.6 | 16.3 | 17.5 | **18.4** | |
| DocVQA<sub>test</sub> | 71.0 | 86.9 | 87.6 | **89.9** | 85.0 | |
| InfoVQA<sub>test</sub> | 40.0 | 59.5 | 59.1 | **65.4** | 58.3 | |
| ChartQA<sub>test</sub> | 59.6 | 71.4 | 57.1 | 73.5 | **76.5** | |
| TextVQA<sub>val</sub> | 74.3 | 73.5 | 77.6 | **79.9** | 76.4 | |
| OCRVQA<sub>testcore</sub> | 54.4 | 40.2 | 67.6 | **68.7** | 64.0 | |
| VCR<sub>en easy</sub> | 27.6 | 51.6 | 67.7 | 68.3 | **70.0** | |
| OCRBench | 613 | 784 | 782 | **810** | 772 | |
| Average | 53.5 | 58.8 | 60.9 | 62.1 | **64.1** | |
For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.
# Future Plan
* We plan to train models of various sizes.
* Future training will incorporate multi-image and video data.
## **Citation**
If you find this dataset useful, please cite the following work
```
@misc{gu2024infinitymmscalingmultimodalperformance,
title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
year={2024},
eprint={2410.18558},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18558},
}
``` |