upload README.md

6efcca1 verified 5 months ago

9.35 kB

	---
	license: apache-2.0
	datasets:
	- FreedomIntelligence/ALLaVA-4V
	language:
	- en
	pipeline_tag: text-generation
	---


	# ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model



	<p align="center">
	⚡ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.⚡
	</p>

	<!-- <p align="center">

	![Python 3.10](https://img.shields.io/badge/Python-3.10-lightblue) ![Pytorch 1.13.0](https://img.shields.io/badge/PyTorch-2.1.1-lightblue) ![transformers](https://img.shields.io/badge/transformers-4.37.0-lightblue)
	</p> -->



	<p align="center">
	📃 <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> • 🌐 <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> • 👨🏻‍💻 <a href="https://github.com/FreedomIntelligence/ALLaVA" target="_blank">Github</a>
	</p>

	<p align="center">
	🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a>
	</p>

	<p align="center">
	🤗 <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k" target="_blank">ALLaVA-Phi3-mini-128k</a>
	• 🤗 <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B" target="_blank">ALLaVA-StableLM2-1_6B</a>
	• 🤗 <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B" target="_blank">ALLaVA-Phi2-2_7B</a>
	</p>

	<!-- <p align="center">
	📃 <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> • 🌐 <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer" target="_blank">ALLaVA-3B-Longer</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B" target="_blank">ALLaVA-3B</a>
	<br> <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README_zh.md"> 中文</a> \| <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README.md"> English
	</p> -->

	## Benchmark Result

	Our models [ALLaVA-Phi3-mini-128k](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k),
	[ALLaVA-StableLM2-1_6B](https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B)
	and [ALLaVA-Phi2-2_7B](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B)
	achieve competitive results on 17 benchmarks.


	\| Models \| Vicuna-80 \| GQA \| HallusionBench \| MME-P \| MMVP \| TouchStone \| TextVQA \| MME-C \| MathVista \| MM-Vet \| MMMU-val \| SQA (img) \| LLaVA (In-the-Wild) \| MLLM-Bench \| MMB-en \| MMB-cn \| SEEDBench (img, v1) \|
	\|---------------------------\|-----------\|-----\|-------\|-------\|------\|----\|---------\|-------\|----\|--------\|-----------------\|---------\|---------------\|----\|--------\|--------\|--------------------\|
	\| Large VLMs \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|
	\| BLIP-2 \| - \| - \| - \| - \| - \| - \| - \| - \| - \| 22.4 \| 34.4 \| - \| - \| 3.0*\| - \| - \| 49.7 \|
	\| InstructBLIP \| - \| 49.5\| - \| - \| - \| - \| - \| - \| - \| 25.6 \| - \| - \| 58.2 \| - \| 44.0 \| - \| - \|
	\| Qwen-VL-Chat \| - \| 57.5\| - \| 1487.6\| - \| - \| 61.5 \| 360.7 \| - \| 31.1 \| - \| 68.2 \| - \| - \| 60.6 \| 56.7 \| 65.4 \|
	\| LLaVA-1.5-7B \| 13.8* \| 62.0\| 36.6* \| 1504.4\| 24.7\| 594.9\| 58.2\| 324.6\| 25.0\| 31.1\| 35.1\| 66.8\| 65.4\| 23.0*\| 64.3\| 58.3\| 66.1\|
	\| LLaVA-1.5-13B \| 22.5 \| 63.3\| 36.5* \| 1531.3 \| 38.0\| 617.7\| 61.3\| 295.4\| 28.3\| 35.4\| 34.4\| 71.6\| 72.5\| -\| 67.7\| 63.6\| 68.2\|
	\| LVIS-7B \| - \| 62.6\| - \| - \| - \| - \| 58.7 \| - \| - \| 31.5 \| - \| - \| 67.0 \| 29.0*\| 66.2 \| - \| - \|
	\| LVIS-13B \| - \| 63.6\| - \| - \| - \| - \| 62.5 \| - \| - \| 37.4* \| - \| - \| 71.3* \| - \| 68.0* \| - \| - \|
	\| ShareGPT4V-7B \| 13.8* \| 63.3\| 36.0* \| 1540.1\| 34.0\| 637.2\| 60.4\| 346.1\| 24.7\| 37.6\| 35.4\| 68.4\| 72.6\| 30.2\| 68.8\| 61.0*\| 69.7\|
	\| ShareGPT4V-13B \| 17.5* \| 64.8\| 39.0* \| 1576.1\| 35.3\| 648.7\| 62.2\| 309.3\| 28.8\| 43.1\| 35.6\| 70.0\| 79.9\| 35.5\| 71.2\| 61.7*\| 70.8\|
	\| 4B-scale Lite VLMs \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|
	\| MobileVLM-v2 \| 5.0* \| 61.1\| 30.8* \| 1440.5 \| 18.7\| 541.0\| 57.5\| 261.8\| 28.3\| 26.1\| 30.8\| 70.0\| 53.2\| 15.7\| 63.2\| 43.2\| 64.5\|
	\| Mipha-3B \| 16.2* \| 63.9\| 34.3\| 1488.9\| 32.0\| 619.0\| 56.6\| 285.0\| 27.8\| 33.5\| 35.8\| 70.9\| 64.7\| 23.1\| 69.7\| 42.9\| 71.2*\|
	\| TinyLLaVA \| 15.6* \| 62.1\| 37.2* \| 1465.5\| 33.3\| 663.5\| 60.3\| 281.1\| 30.3\| 37.5\| 38.4\| 73.0\| 70.8\| 29.8\| 69.7*\| 42.8\| 70.4*\|
	\| Ours \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|
	\| ALLaVA-Phi2 \| 49.4 \| 48.8\| 24.8 \| 1316.2\| 36.0\| 632.0\| 49.5\| 301.8\| 27.4\| 32.2\| 35.3\| 67.6\| 69.4\| 43.6\| 64.0\| 40.8\| 65.2\|
	\| ALLaVA-StableLM2 \| 38.8 \| 49.8\| 25.3 \| 1311.7\| 34.0 \| 655.2\| 51.7\| 257.9\| 27.7\| 31.7\| 33.3\| 64.7\| 72.0\| 39.3\| 64.6\| 49.8\| 65.7\|
	\| ALLaVA-Phi3 \| 56.9\| 52.2\| 48.1\| 1382.3\| 32.7\| 667.8\| 53.0\| 347.1\| 32.9\| 37.8\| 41.1\| 64.0\| 68.5\| 54.8\| 68.1\| 55.3\| 69.0\|


	> \* denotes the results of our evaluation. Bold numbers are the best results among all 4B-scale LVLMs.The detailed information of each benchmark is shown in Table 4 of our [technical report](https://arxiv.org/pdf/2402.11684.pdf).



	## 🏭 Inference

	All models can be loaded from 🤗 with `.from_pretrained()`.
	Check out the [example scripts](https://github.com/FreedomIntelligence/ALLaVA/tree/main/allava/serve) and make sure you have the same outputs as shown in the scripts.
	<!-- ### Load from 🤗 (Recommended)
	See the [example script](https://github.com/FreedomIntelligence/ALLaVA/blob/main/allava/serve/huggingface_inference.py). -->

	<!-- ### CLI
	See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#cli) for CLI code snippet. -->



	## 🏋️‍♂️ Training

	### Data
	<div align=center>
	<img src="training_datasets_by_stage.jpg" width = "640" alt="training_datasets" align=center />
	</div>

	ALLaVA uses 1.0M and 1.5M data for PT. and FT., respectively.


	### Code
	The training code is largely based on [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA).
	We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.

	<!-- ### Cost
	We train our models on 8*A800 GPUs.
	[ALLaVA-3B-Longer](https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer) takes 8.3h for PT and 21.3h for FT.
	[ALLaVA-3B](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) takes 8.3h for PT and 10.6h for FT.
	These two models share the same PT procedure. -->


	### Hyperparameters

	\| Global Batch Size\| ZeRO Stage\| Optimizer \| Max LR\| Min LR \| Scheduler \| Weight decay \|
	\| ---: \| ---: \|--:\| ---: \| ---: \| ---: \| ---: \|
	\| 256 (PT) / 128 (FT) \| 1\| AdamW \| 2e-5 \| 2e-6 \| CosineAnnealingWarmRestarts \| 0 \|

	The LM backbone, projector are trainable, while the vision encoder is kept frozen.
	The trainabilities of each module are the same for both stages.


	## 📚 ALLaVA-4V Data

	The majority part of training data is [ALLaVA-4V](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#data-preparation) to prepare it for training.


	## 🙌 Contributors

	- Project Leader: [Guiming Hardy Chen](https://g-h-chen.github.io/)

	- Data: Shunian Chen, [Junying Chen](https://jymchen.github.io/), Xiangbo Wu

	- Evaluation: [Ruifei Zhang](https://scholar.google.com/citations?user=W4zOhmEAAAAJ&hl=zh-CN)

	- Deployment: Xiangbo Wu, Zhiyi Zhang

	- Advising: [Zhihong Chen](https://zhjohnchan.github.io/), [Benyou Wang](https://wabyking.github.io/old.html)

	- Others: Jianquan Li, [Xiang Wan](https://scholar.google.com/citations?user=e3_kWigAAAAJ&hl=zh-CN)





	## 📝 Citation
	If you find our data useful, please consider citing our work! We are FreedomIntelligence from [Shenzhen Research Institute of Big Data](http://sribd.cn/en) and [The Chinese University of Hong Kong, Shenzhen](https://sds.cuhk.edu.cn/en)
	```
	@article{chen2024allava,
	title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model},
	author={Chen, Guiming Hardy and Chen, Shunian and Zhang, Ruifei and Chen, Junying and Wu, Xiangbo and Zhang, Zhiyi and Chen, Zhihong and Li, Jianquan and Wan, Xiang and Wang, Benyou},
	journal={arXiv preprint arXiv:2402.11684},
	year={2024}
	}
	```