File size: 8,280 Bytes
f325375 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
license: mit
datasets:
- nyu-visionx/Cambrian-10M
language:
- en
base_model:
- tiiuae/falcon-mamba-7b-instruct
pipeline_tag: image-text-to-text
---
# Viper: Open Mamba-based Vision-Language Models
**Yufan Zhuang<sup>1,2</sup>, Pierce Chuang<sup>2</sup>, Yichao Lu<sup>2</sup>, Abhay Harpale<sup>2</sup>, Vikas Bhardwaj<sup>2</sup>, Jingbo Shang<sup>1</sup>**
**<sup>1</sup>UC San Diego**, **<sup>2</sup>Meta**
[Viper-Jamba-52B](https://huggingface.co/ViperVLM/Viper-Jamba-52B) || [Viper-Mamba-7B](https://huggingface.co/ViperVLM/Viper-Mamba-7B) || [Evaluation](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) || [Github](https://github.com/EvanZhuang/viper)
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6438ccbb3b46237de3d052e8/RFArMOH2TMI_G9bZTZr8_.jpeg)
(Logo Created by ChatGPT-4o)
* Viper VLMs are built on the Mamba architecture, which offers efficiency and strong performance in handling long-range dependencies compared to Transformers.
* The models process visual tokens from entire images, leveraging Mamba's strengths in linear-time complexity and long-range reasoning for vision tasks, and are trained on the Cambrian-7M dataset, supporting up to 2K resolution.
* Viper VLMs demonstrate competitive performance on diverse benchmarks, setting the stage for potential future shifts in vision-language model architectures.
## Introduction
We introduce *Viper*, a series of open vision language models (VLMs) built on the Mamba architecture.
Since Mamba's inception, it has been regarded as a promising alternative to the Transformer as the foundational architecture for large language models.
Mamba offers a significant advantage in terms of linear-time complexities with respect to input sequence length, while also outperforming Transformers in tasks that require long-range dependencies understanding.
In Viper VLMs, we imbibe all visual tokens into the model and inference on the entire image, relying on Mamba's efficiency and long-range reasoning power to comprehend the vision inputs.
The models are trained on the Cambrian-7M, natively supporting up to 2K resolution.
We show that Viper VLMs are competitive with open-sourced VLMs across diverse benchmarks.
This work lays the groundwork for potential architectural shifts in future vision-language models, highlighting Mamba's promising role in advancing this field.
## Model Architecture
We use the single-encoder design with linear projectors connecting the vision encoder and LLM backbones.
| Model | Encoder | LLM backbone| Arch | Input Resolution (Training)
|----------|----------|----------|----------|----------|
| Viper-Jamba-52B | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [Jamba-1.5-Mini](https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini) | MoE-Jamba | Up to 1344x1344 pixels |
| Viper-Mamba-7B | [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | [falcon-mamba-7b-instruct](tiiuae/falcon-mamba-7b-instruct) | Dense-Mamba | Up to 2352x2352 pixels|
We utilized AnyRes for supporting high-resolution inputs.
## Evaluation
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6438ccbb3b46237de3d052e8/qs5uJXAgUUE1qL1XeWghH.png)
## Usage
Environment Configuration
```
git clone https://github.com/EvanZhuang/viper.git
cd ./viper
```
Create conda environment
```
conda create --name viper python=3.10
conda activate viper
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install mamba-ssm[causal-conv1d]
```
Dependent on [flash-attn](https://github.com/Dao-AILab/flash-attention), [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d), [mamba-ssm](https://github.com/state-spaces/mamba)
Install from here:
```
pip install vipervlm
```
Then you can use the Viper VLMs in the following way:
```
import copy
import torch
from viper.model.builder import load_pretrained_model
from viper.conversation import conv_templates
from viper.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
model_path = "Viper-Mamba-7B"
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, model_name, use_flash_attn=True)
model.eval()
conv_mode = 'system_jamba'
DEFAULT_IMAGE_TOKEN = '<image>'
IMAGE_TOKEN_INDEX = -200
content, images = '', []
image_sizes = [] # Store image sizes
# Process Input in Chat format
for msg in message:
if msg['type'] == 'text':
content += msg['value']
else:
img = Image.open(msg['value']).convert('RGB')
images.append(img)
image_sizes.append(img.size) # Store the size of each image
content += (DEFAULT_IMAGE_TOKEN + '\n')
# Process images using the class attribute process_images
image_tensor = process_images(images, image_processor, model.config)[0]
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], content)
prompt_question = conv.get_prompt(add_generation_prompt=True)
input_ids = tokenizer_image_token(prompt_question,
tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt')
input_ids = input_ids.unsqueeze(0).to(device='cuda', non_blocking=True)
image_tensor = image_tensor.unsqueeze(0).to(dtype=torch.bfloat16, device='cuda', non_blocking=True)
# Pass image sizes along with other parameters
with torch.inference_mode():
cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
max_new_tokens=4096,
temperature=0,
pad_token_id=tokenizer.pad_token_id,
use_cache=True,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
```
## Throughput Analysis
Viper-Jamba-52B's active parameter size is only 12B.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6438ccbb3b46237de3d052e8/9WMOvMv24vJTLTFTHTzBW.png)
## Dataset
We train our models on [Cambrian-7M](https://github.com/cambrian-mllm/cambrian).
These datasets provide a wide variety of high-quality image-conversation pairs sourced from diverse environments and contexts, enabling robust multi-modal learning.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6438ccbb3b46237de3d052e8/xgK6Bg8TuFbWzB4BephZn.png)
## Training Recipe
We employ a progressive three-stage training procedure designed to optimize performance across varying levels of input complexity and resolution.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6438ccbb3b46237de3d052e8/vQHSIf3PRYab1g8c-owzJ.png)
The training process begins with low-resolution inputs, allowing the model to focus on basic structural and semantic relationships without the computational overhead of detailed features.
In the second stage, we introduce medium-resolution inputs, expanding the model’s capacity to capture more nuanced patterns while gradually increasing sequence length.
Finally, in the high-resolution stage, the model is trained on longer sequences with a broader range of input variability, enhancing its ability to generalize to diverse, complex visual and linguistic tasks.
This staged approach ensures a smooth transition from coarse to fine-grained learning, while maintaining models' capabilities.
| Traing Config | |
| -------- | ------- |
| GPUs | 128 H100-80G |
| Training time | 14 Days |
| Training data | Cambrian-7M |
## Acknowledgment
This project is built upon the following awesome projects [LLaVA](https://github.com/haotian-liu/LLaVA), [Open-LLaVA-NeXT](https://github.com/xiaoachen98/Open-LLaVA-NeXT).
We thank AI21 Labs and Technology Innovation Institute for open-sourcing the powerful LLMs.
We also thank the [Cambrian-1](https://cambrian-mllm.github.io/) project for providing such high-quality vision-language datasets.
## Citation
The paper is coming soon. Meanwhile, please use the following to cite:
```
@article{vipervlm,
title={Viper: Open Mamba-based Vision-Language Models},
author={Zhuang, Yufan and Chuang, Pierce and Lu, Yichao and Harpale, Abhay and Bhardwaj, Vikas and Shang, Jingbo},
year={2024}
}
``` |