Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

Mini-Monkey is a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive cropping strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090.

TODO

Open source code, weight, and data
Support training using 3090 GPUs (24Gb video memory)
Mini-Monkey with different LLMs

Model Zoo

Mini-Monkey was trained using 8 3090 GPUs on a dataset

Model	#param	MME	RWQA	AI2D	CCB	SEED	HallB	POPE	MathVista	DocVQA	ChartQA	InfoVQA$	TextVQA	OCRBench
Mini-Gemini	35B	2141.0	-	-	-	-	-	-	43.3	-	-	-	-	-
LLaVA-NeXT	35B	2028.0	-	74.9	49.2	75.9	34.8	89.6	46.5	-	-	-	-	-
InternVL 1.2	40B	2175.4	67.5	79.0	59.2	75.6	47.6	88.0	47.7	-	-	-	-	-
InternVL 1.5	26B	2187.8	66.0	80.7	69.8	76.0	49.3	88.3	53.5	90.9	83.8	72.5	80.6	724
DeepSeek-VL	1.7B	1531.6	49.7	51.5	37.6	43.7	27.6	85.9	29.4	-	-	-	-	-
Mini-Gemini	2.2B	1653.0	-	-	-	-	-	-	29.4	-	-	-	-	-
Bunny-StableLM-2	2B	1602.9	-	-	-	58.8	-	85.9	-	-	-	-	-	-
MiniCPM-V-2	2.8B	1808.6	55.8	62.9	48.0	-	36.1	86.3	38.7	71.9	55.6	-	74.1	605
InternVL 2	2B	1876.8	57.3	74.1	74.7	70.9	37.9	85.2	46.3	86.9	76.2	58.9	73.4	784
Mini-Monkey (ours)	2B	1881.9	57.5	74.7	75.5	71.3	38.7	86.7	47.3	87.4	76.5	60.1	75.7	802

Environment

conda create -n minimonkey python=3.10
conda activate minimonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey/project/mini_monkey
pip install -r requirements.txt

Install flash-attn==2.3.6:

pip install flash-attn==2.3.6 --no-build-isolation

Alternatively you can compile from source:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

Evaluate

We use VLMEvalKit repositories for model evaluation.

Inference

We provide an example of inference code here

Train

Prepare Training Datasets

Inspired by InternVL 1.2, we adopted a LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with InternVL 1.2.

First, download the annotation files and place them in the playground/opensource/ folder.

Second, download all the images we used.

AI2D: ai2d_images (provided by InternLM-XComposer)
ChartQA: ChartQA Dataset
COCO: train2017
DocVQA: train, val, test
DVQA: images
LLaVA-Pretrain: images
SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted images.
GeoQA+: GeoQA+ images

Then, organize the data as follows in playground/data:

playground/
├── opensource
│   ├── ai2d_train_12k.jsonl
│   ├── chartqa_train_18k.jsonl
│   ├── docvqa_train_10k.jsonl
│   ├── dvqa_train_200k.jsonl
│   ├── geoqa+.jsonl
│   ├── llava_instruct_150k_zh.jsonl
│   └── synthdog_en.jsonl
├── data
│   ├── ai2d
│   │   ├── abc_images
│   │   └── images
│   ├── chartqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── coco
│   │   └── train2017
│   ├── docvqa
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── dvqa
│   │   └── images
│   ├── llava
│   │   └── llava_pretrain
│   │       └── images
│   ├── synthdog-en
│   │   └── images
│   ├── geoqa+
│   │   └── images

Execute the training code:

sh shell/minimonkey/minimonkey_finetune_full.sh

Citing Mini-Monkey

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{huang2024mini,
  title={Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models},
  author={Huang, Mingxin and Liu, Yuliang and Liang, Dingkang and Jin, Lianwen and Bai, Xiang},
  journal={arXiv preprint arXiv:2408.02034},
  year={2024}
}

Copyright

We welcome suggestions to help us improve the Mini-Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue.