Papers
arxiv:2402.11530

Efficient Multimodal Learning from Data-centric Perspective

Published on Feb 18
Authors:
,
,
,
,
,

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably causes significant performance drop. In this paper, we demonstrate the possibility to beat the scaling law and train a smaller but better MLLM by exploring more informative training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from condensed training data. Remarkably, our Bunny-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-v1.5-13B, on multiple benchmarks. The code, models and data can be found in https://github.com/BAAI-DCAI/Bunny.

Community

Sign up or log in to comment

Models citing this paper 13

Browse 13 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 1