BAAI
/

gsh33 commited on
Commit
0c052b0
1 Parent(s): 31527d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -4
README.md CHANGED
@@ -10,12 +10,15 @@ library_name: transformers
10
 
11
  # Introduction
12
 
13
- The **Aquila-VL-2B-CG** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
14
 
15
  The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
16
 
17
 
18
- We have open-sourced [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and related resources. The Aquila-VL-2B-CG was trained using domestic GPUs, and the model files will be updated in early November 2024. We hope you enjoy using them!
 
 
 
19
 
20
  ## News
21
 
@@ -63,5 +66,16 @@ For comparison models, evaluations were conducted in a local environment, so the
63
  * Future training will incorporate multi-image and video data.
64
 
65
 
66
- # Disclaimer
67
- The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced the model is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.[]()
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  # Introduction
12
 
13
+ The **Aquila-VL-2B** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
14
 
15
  The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
16
 
17
 
18
+
19
+ We have open-sourced the [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and trained the [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model on NVIDIA GPUs using this dataset.
20
+
21
+ The Aquila-VL-2B-CG model in this repository was trained using different GPUs and will be open-sourced soon.
22
 
23
  ## News
24
 
 
66
  * Future training will incorporate multi-image and video data.
67
 
68
 
69
+ ## **Citation**
70
+ If you find this dataset useful, please cite the following work
71
+ ```
72
+ @misc{gu2024infinitymmscalingmultimodalperformance,
73
+ title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
74
+ author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
75
+ year={2024},
76
+ eprint={2410.18558},
77
+ archivePrefix={arXiv},
78
+ primaryClass={cs.CL},
79
+ url={https://arxiv.org/abs/2410.18558},
80
+ }
81
+ ```