File size: 9,351 Bytes
6efcca1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
license: apache-2.0
datasets:
- FreedomIntelligence/ALLaVA-4V
language:
- en
pipeline_tag: text-generation
---
# ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
<p align="center">
β‘ALLaVA is a project that provides a large-scale GPT4V-synthesized dataset for training LVLMs.β‘
</p>
<!-- <p align="center">
![Python 3.10](https://img.shields.io/badge/Python-3.10-lightblue) ![Pytorch 1.13.0](https://img.shields.io/badge/PyTorch-2.1.1-lightblue) ![transformers](https://img.shields.io/badge/transformers-4.37.0-lightblue)
</p> -->
<p align="center">
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π¨π»βπ» <a href="https://github.com/FreedomIntelligence/ALLaVA" target="_blank">Github</a>
</p>
<p align="center">
π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a>
</p>
<p align="center">
π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k" target="_blank">ALLaVA-Phi3-mini-128k</a>
β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B" target="_blank">ALLaVA-StableLM2-1_6B</a>
β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B" target="_blank">ALLaVA-Phi2-2_7B</a>
</p>
<!-- <p align="center">
π <a href="https://arxiv.org/abs/2402.11684" target="_blank">Paper</a> β’ π <a href="https://allava.freedomai.cn/#/" target="_blank">Demo</a> β’ π€ <a href="https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V" target="_blank">ALLaVA-4V Dataset</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer" target="_blank">ALLaVA-3B-Longer</a> β’ π€ <a href="https://huggingface.co/FreedomIntelligence/ALLaVA-3B" target="_blank">ALLaVA-3B</a>
<br> <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README_zh.md"> δΈζ</a> | <a href="https://github.com/FreedomIntelligence/CMB/blob/main/README.md"> English
</p> -->
## Benchmark Result
Our models [**ALLaVA-Phi3-mini-128k**](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi3-mini-128k),
[**ALLaVA-StableLM2-1_6B**](https://huggingface.co/FreedomIntelligence/ALLaVA-StableLM2-1_6B)
and [**ALLaVA-Phi2-2_7B**](https://huggingface.co/FreedomIntelligence/ALLaVA-Phi2-2_7B)
achieve competitive results on 17 benchmarks.
| Models | Vicuna-80 | GQA | HallusionBench | MME-P | MMVP | TouchStone | TextVQA | MME-C | MathVista | MM-Vet | MMMU-val | SQA (img) | LLaVA (In-the-Wild) | MLLM-Bench | MMB-en | MMB-cn | SEEDBench (img, v1) |
|---------------------------|-----------|-----|-------|-------|------|----|---------|-------|----|--------|-----------------|---------|---------------|----|--------|--------|--------------------|
| **Large VLMs** | | | | | | | | | | | | | | | | | |
| BLIP-2 | - | - | - | - | - | - | - | - | - | 22.4 | 34.4 | - | - | 3.0*| - | - | 49.7 |
| InstructBLIP | - | 49.5| - | - | - | - | - | - | - | 25.6 | - | - | 58.2 | - | 44.0 | - | - |
| Qwen-VL-Chat | - | 57.5| - | 1487.6| - | - | 61.5 | 360.7 | - | 31.1 | - | 68.2 | - | - | 60.6 | 56.7 | 65.4 |
| LLaVA-1.5-7B | 13.8* | 62.0| 36.6* | 1504.4*| 24.7*| 594.9*| 58.2| 324.6*| 25.0*| 31.1| 35.1*| 66.8| 65.4| 23.0*| 64.3| 58.3| 66.1|
| LLaVA-1.5-13B | 22.5 | 63.3| 36.5* | 1531.3 | 38.0*| 617.7*| 61.3| 295.4| 28.3*| 35.4| 34.4*| 71.6| 72.5| -| 67.7| 63.6| 68.2|
| LVIS-7B | - | 62.6| - | - | - | - | 58.7 | - | - | 31.5 | - | - | 67.0 | 29.0*| 66.2 | - | - |
| LVIS-13B | - | 63.6*| - | - | - | - | 62.5* | - | - | 37.4* | - | - | 71.3* | - | 68.0* | - | - |
| ShareGPT4V-7B | 13.8* | 63.3| 36.0* | 1540.1*| 34.0*| 637.2*| 60.4| 346.1*| 24.7*| 37.6| 35.4*| 68.4*| 72.6| 30.2*| 68.8| 61.0*| 69.7|
| ShareGPT4V-13B | 17.5* | 64.8| 39.0* | 1576.1*| 35.3*| 648.7*| 62.2| 309.3*| 28.8*| 43.1| 35.6*| 70.0*| 79.9| 35.5*| 71.2| 61.7*| 70.8|
| **4B-scale Lite VLMs** | | | | | | | | | | | | | | | | | |
| MobileVLM-v2 | 5.0* | 61.1| 30.8* | 1440.5 | 18.7*| 541.0*| 57.5| 261.8*| 28.3*| 26.1*| 30.8*| 70.0| 53.2*| 15.7*| 63.2| 43.2*| 64.5*|
| Mipha-3B | 16.2* | **63.9**| 34.3*| **1488.9**| 32.0*| 619.0*| 56.6| 285.0*| 27.8*| 33.5*| 35.8*| 70.9| 64.7*| 23.1*| **69.7**| 42.9*| **71.2***|
| TinyLLaVA | 15.6* | 62.1| 37.2* | 1465.5*| 33.3*| 663.5*| **60.3**| 281.1*| 30.3*| 37.5| 38.4| **73.0**| 70.8*| 29.8*| **69.7***| 42.8*| 70.4*|
| **Ours** | | | | | | | | | | | | | | | | | |
| **ALLaVA-Phi2** | 49.4 | 48.8| 24.8 | 1316.2| **36.0**| 632.0| 49.5| 301.8| 27.4| 32.2| 35.3| 67.6| 69.4| 43.6| 64.0| 40.8| 65.2|
| **ALLaVA-StableLM2** | 38.8 | 49.8| 25.3 | 1311.7| 34.0 | 655.2| 51.7| 257.9| 27.7| 31.7| 33.3| 64.7| **72.0**| 39.3| 64.6| 49.8| 65.7|
| **ALLaVA-Phi3** | **56.9**| 52.2| **48.1**| 1382.3| 32.7| **667.8**| 53.0| **347.1**| **32.9**| **37.8**| **41.1**| 64.0| 68.5| **54.8**| 68.1| **55.3**| 69.0|
> \* denotes the results of our evaluation. **Bold numbers** are the best results among all 4B-scale LVLMs.The detailed information of each benchmark is shown in Table 4 of our [technical report](https://arxiv.org/pdf/2402.11684.pdf).
## π Inference
All models can be loaded from π€ with `.from_pretrained()`.
Check out the [example scripts](https://github.com/FreedomIntelligence/ALLaVA/tree/main/allava/serve) and make sure you have the same outputs as shown in the scripts.
<!-- ### Load from π€ (Recommended)
See the [example script](https://github.com/FreedomIntelligence/ALLaVA/blob/main/allava/serve/huggingface_inference.py). -->
<!-- ### CLI
See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#cli) for CLI code snippet. -->
## ποΈββοΈ Training
### Data
<div align=center>
<img src="training_datasets_by_stage.jpg" width = "640" alt="training_datasets" align=center />
</div>
ALLaVA uses 1.0M and 1.5M data for PT. and FT., respectively.
### Code
The training code is largely based on [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA).
We wholeheartedly express our gratitude for their invaluable contributions to open-sourcing LVLMs.
<!-- ### Cost
We train our models on 8*A800 GPUs.
[ALLaVA-3B-Longer](https://huggingface.co/FreedomIntelligence/ALLaVA-3B-Longer) takes 8.3h for PT and 21.3h for FT.
[ALLaVA-3B](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) takes 8.3h for PT and 10.6h for FT.
These two models share the same PT procedure. -->
### Hyperparameters
| Global Batch Size| ZeRO Stage| Optimizer | Max LR| Min LR | Scheduler | Weight decay |
| ---: | ---: |--:| ---: | ---: | ---: | ---: |
| 256 (PT) / 128 (FT) | 1| AdamW | 2e-5 | 2e-6 | CosineAnnealingWarmRestarts | 0 |
The LM backbone, projector are trainable, while the vision encoder is kept frozen.
**The trainabilities of each module are the same for both stages.**
## π ALLaVA-4V Data
The majority part of training data is [ALLaVA-4V](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). See [here](https://github.com/FreedomIntelligence/ALLaVA/tree/main?tab=readme-ov-file#data-preparation) to prepare it for training.
## π Contributors
- Project Leader: [Guiming Hardy Chen](https://g-h-chen.github.io/)
- Data: Shunian Chen, [Junying Chen](https://jymchen.github.io/), Xiangbo Wu
- Evaluation: [Ruifei Zhang](https://scholar.google.com/citations?user=W4zOhmEAAAAJ&hl=zh-CN)
- Deployment: Xiangbo Wu, Zhiyi Zhang
- Advising: [Zhihong Chen](https://zhjohnchan.github.io/), [Benyou Wang](https://wabyking.github.io/old.html)
- Others: Jianquan Li, [Xiang Wan](https://scholar.google.com/citations?user=e3_kWigAAAAJ&hl=zh-CN)
## π Citation
If you find our data useful, please consider citing our work! We are FreedomIntelligence from [Shenzhen Research Institute of Big Data](http://sribd.cn/en) and [The Chinese University of Hong Kong, Shenzhen](https://sds.cuhk.edu.cn/en)
```
@article{chen2024allava,
title={ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model},
author={Chen, Guiming Hardy and Chen, Shunian and Zhang, Ruifei and Chen, Junying and Wu, Xiangbo and Zhang, Zhiyi and Chen, Zhihong and Li, Jianquan and Wan, Xiang and Wang, Benyou},
journal={arXiv preprint arXiv:2402.11684},
year={2024}
}
```
|