SPHINX-V Model Card
Model type:
SPHINX-V is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.
Paper or resources for more information:
Project Page: Draw-and-Understand
Paper: https://arxiv.org/abs/2403.20271
Code: https://github.com/AFeng-x/Draw-and-Understand
Dataset: MDVP-Data & MDVP-Bench
Intended use
Primary intended uses: The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots.
Primary intended users: The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence.
License
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
Citations
@misc{lin2024drawandunderstand,
title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want},
author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
year={2024},
eprint={2403.20271},
archivePrefix={arXiv},
primaryClass={cs.CV}
}