Edit model card

Model Card for Model ID

Llama-3.2V-11B-cot is the first version of LLaVA-o1, which is a visual language model capable of spontaneous, systematic reasoning.

The model was proposed in LLaVA-o1: Let Vision Language Models Reason Step-by-Step.

Model Details

  • License: apache-2.0
  • Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

Benchmark Results

MMStar MMBench MMVet MathVista AI2D Hallusion Average
57.6 75.0 60.3 54.8 85.7 47.8 63.5

Reproduction

To reproduce our results, you should use VLMEvalKit and the following settings.

Parameter Value
do_sample True
temperature 0.6
top_p 0.9
max_new_tokens 2048

You may change them in this file, line 80-83, and modify the max_new_tokens throughout the file.

Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048.

After you get the results, you should filter the model output and only keep the outputs between <CONCLUSION> and </CONCLUSION>.

This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes.

By keeping the outputs between <CONCLUSION> and </CONCLUSION>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.

How to Get Started with the Model

You can use the inference code for Llama-3.2-11B-Vision-Instruct.

Training Details

Training Data

The model is trained on the LLaVA-o1-100k dataset (to be released).

Training Procedure

The model is finetuned on llama-recipes with the following settings. Using the same setting should accurately reproduce our results.

Parameter Value
FSDP enabled
lr 1e-5
num_epochs 3
batch_size_training 4
use_fast_kernels True
run_validation False
batching_strategy padding
context_length 4096
gradient_accumulation_steps 1
gradient_clipping False
gradient_clipping_threshold 1.0
weight_decay 0.0
gamma 0.85
seed 42
use_fp16 False
mixed_precision True

Bias, Risks, and Limitations

The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. Technically, the model's performance in aspects like instruction following still falls short of leading industry models.

Downloads last month
5,235
Safetensors
Model size
10.7B params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for Xkev/Llama-3.2V-11B-cot

Finetuned
(60)
this model
Finetunes
1 model

Spaces using Xkev/Llama-3.2V-11B-cot 3

Collection including Xkev/Llama-3.2V-11B-cot