SEED-X
We introduce SEED-X, a unified and versatile foundation model, which can serve as various multimodal AI assistants in the real world after different instruction tuning, capable of responding to a variety of user needs through unifying multi-granularity comprehension and generation.
All models and inference code are released!
News
2024-04-22 :hugs: We release the models including the pre-trained foundation model SEED-X, the general instruction-tuned model SEED-X-I, the editing model SEED-X-Edit, and our de-tokenier, which can generate realistic images from ViT features (w/o or w/ a condition image).
2024-04-22 :hugs: We release an online gradio demo of a general instruction-tuned model SEED-X-I. SEED-X-I can follow multimodal instruction (including images with dynamic resolutions) and make responses with images, texts and bounding boxes in multi-turn conversation. SEED-X-I does not support image manipulation. If you want to experience SEED-X-Edit for high-precision image editing, the inference code and model will be released soon.
TODOs
- Release the multimodal foundation model SEED-X.
- Release the instruction-tuned model SEED-X-Edit for high-precision image editing.
- Release 3.7M in-house image editing data.
Usage
Dependencies
- Python >= 3.8 (Recommend to use Anaconda)
- PyTorch >=2.0.1
- NVIDIA GPU + CUDA
Installation
Clone the repo and install dependent packages
git clone https://github.com/AILab-CVC/SEED-X.git
cd SEED-X
pip install -r requirements.txt
Model Weights
We release the pretrained De-Tokenizer, the pre-trained foundation model SEED-X, the general instruction-tuned model SEED-X-I, the editing model SEED-X-Edit in in SEED-X-17B Hugging Face.
You can also download them separately as below,
- Check the SEED-X de-tokenizer weights in AILab-CVC/seed-x-17b-de-tokenizer
- Check the pre-trained foundation model SEED-X weights in AILab-CVC/seed-x-17b-pretrain
- Check the general instruction-tuned model SEED-X-I weights in AILab-CVC/seed-x-17b-instruct
- Check the editing model SEED-X-Edit weights in AILab-CVC/seed-x-17b-edit
Please download the checkpoints and save them under the folder ./pretrained
. For example, ./pretrained/seed_x
.
You also need to download stable-diffusion-xl-base-1.0 and Qwen-VL-Chat, and save them under the folder ./pretrained
. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.
python3 src/tools/reload_qwen_vit.py
Inference with SEED-X De-tokenizer
# For image reconstruction with ViT image features
python3 src/inference/eval_seed_x_detokenizer.py
# For image reconstruction with ViT image features and conditional image
python3 src/inference/eval_seed_x_detokenizer_with_condition.py
Inference with pre-trained model SEED-X
# For image comprehension and detection
python3 src/inference/eval_img2text_seed_x.py
# For image generation
python3 src/inference/eval_text2img_seed_x.py
Inference with the general instruction-tuned model SEED-X-I
# For image comprehension and detection
python3 src/inference/eval_img2text_seed_x_i.py
# For image generation
python3 src/inference/eval_text2img_seed_x_i.py
Inference with the editing model SEED-X-Edit
# For image editing
python3 src/inference/eval_img2edit_seed_x_edit.py
Citation
If you find the work helpful, please consider citing:
@article{ge2024seed,
title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},
author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},
journal={arXiv preprint arXiv:2404.14396},
year={2024}
}
License
SEED
is licensed under the Apache License Version 2.0 except for the third-party components listed in License.
During training SEED-X, we freeze the original parameters of LLaMA2 and optimize the LoRA module.