---
license: apache-2.0
datasets:
- omkarthawakar/VRC-Bench
- Xkev/LLaVA-CoT-100k
language:
- en
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
pipeline_tag: question-answering
---


## LlamaV-o1

<center><img src="logo2.png" alt="LlamaV-o1 logo" width="150"/></center>

## Overview
**LlamaV-o1** is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks. 
Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search, 
LlamaV-o1 demonstrates exceptional performance across diverse benchmarks. 
It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception, 
mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach, 
LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research 
and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations, 
LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.

### Key Features:
- **Model Size:** 11 billion parameters.
- **Architecture:** Based on the Llama (Large Language Model Architecture) family.
- **Fine-Tuning:** Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
- **Applications:** Ideal for use cases such as conversational agents, educational tools, content creation, and more.
---
## Model Details
- **Developed By:** MBZUAI
- **Model Version:** v0.1
- **Release Date:** 13th January 2025
- **Training Dataset:** Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
- **Framework:** Pytorch
---

## Intended Use
**LlamaV-o1** is designed for a wide range of NLP tasks, including but not limited to:
- Text Generation
- Sentiment Analysis
- Text Summarization
- Question Answering
- Chain-of-Thought Reasoning

### Out-of-Scope Use
The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.
---

## Training Procedure
- **Fine-Tuning:** The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
- **Optimizations:** Includes inference scaling optimizations to balance performance and computational efficiency.
---
## Evaluation

### Benchmarks
LlamaV-o1 has been evaluated on a suite of benchmark tasks:
- **Reasoning:** [VRC-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench)


### Limitations
While the model performs well on a broad range of tasks, it may struggle with:
- Highly technical, domain-specific knowledge outside the training corpus.
- Generating accurate outputs for ambiguous or adversarial prompts.
---
## Usage
```python
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "omkarthawakar/LlamaV-o1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
```

Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference. 

### Results
**Table 1:** Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.

| **Model**   | **GPT-4o** | **Claude-3.5** | **Gemini-2.0** | **Gemini-1.5 Pro** | **Gemini-1.5 Flash** | **GPT-4o Mini** | **Llama-3.2 Vision** | **Mulberry** | **Llava-CoT** | **LlamaV-o1 (Ours)** |
|-------------|------------|----------------|----------------|-------------------|--------------------|----------------|--------------------|-------------|--------------|-------------------|
| **Final Answer** | 59.28      | **61.35**        | 61.16          | **61.35**         | 54.99              | 56.39          | 48.40              | 51.90       | 54.09        | **56.49**         |
| **Reasoning Steps** | **76.68**   | 72.12            | 74.08          | 72.12             | 71.86             | 74.05          | 58.37              | 63.86       | 66.21        | **68.93**         |
---

### Training Data

LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k).
We have formatted training sample for multi-step reasoning. 

### Training Procedure

LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes).
Detailed Training procedure will be coming soon!

### Citation
If you find this paper useful, please consider staring 🌟 our [Github](https://github.com/mbzuai-oryx/LlamaV-o1) repo and citing 📑 our paper:
```
@misc{thawakar2025llamavo1,
      title={LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs}, 
      author={Omkar Thawakar and Dinura Dissanayake and Ketan More and Ritesh Thawkar and Ahmed Heakl and Noor Ahsan and Yuhao Li and Mohammed Zumri and Jean Lahoud and Rao Muhammad Anwer and Hisham Cholakkal and Ivan Laptev and Mubarak Shah and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2501.06186},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06186}, 
}
```