omkarthawakar/LlamaV-o1 · Hugging Face

LlamaV-o1

Overview

LlamaV-o1 is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks. Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search, LlamaV-o1 demonstrates exceptional performance across diverse benchmarks. It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception, mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach, LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations, LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.

Key Features:

Model Size: 11 billion parameters.
Architecture: Based on the Llama (Large Language Model Architecture) family.
Fine-Tuning: Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
Applications: Ideal for use cases such as conversational agents, educational tools, content creation, and more.

Model Details

Developed By: MBZUAI
Model Version: v0.1
Release Date: 13th January 2025
Training Dataset: Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
Framework: Pytorch

Intended Use

LlamaV-o1 is designed for a wide range of NLP tasks, including but not limited to:

Text Generation
Sentiment Analysis
Text Summarization
Question Answering
Chain-of-Thought Reasoning

Out-of-Scope Use

The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.

Training Procedure

Fine-Tuning: The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
Optimizations: Includes inference scaling optimizations to balance performance and computational efficiency.

Evaluation

Benchmarks

LlamaV-o1 has been evaluated on a suite of benchmark tasks:

Reasoning: VRC-Bench

Limitations

While the model performs well on a broad range of tasks, it may struggle with: - Highly technical, domain-specific knowledge outside the training corpus. - Generating accurate outputs for ambiguous or adversarial prompts.

Usage

from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "omkarthawakar/LlamaV-o1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

Please refer to llamav-o1.py for inference.

Results

Table 1: Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.

Model	GPT-4o	Claude-3.5	Gemini-2.0	Gemini-1.5 Pro	Gemini-1.5 Flash	GPT-4o Mini	Llama-3.2 Vision	Mulberry	Llava-CoT	LlamaV-o1 (Ours)
Final Answer	59.28	61.35	61.16	61.35	54.99	56.39	48.40	51.90	54.09	56.49
Reasoning Steps	76.68	72.12	74.08	72.12	71.86	74.05	58.37	63.86	66.21	68.93

Training Data

LlamaV-o1 is trained on the LLaVA-CoT-100k dataset. We have formatted training sample for multi-step reasoning.

Training Procedure

LlamaV-o1 model is finetuned on llama-recipes. Detailed Training procedure will be coming soon!

Citation

If you find this paper useful, please consider staring 🌟 our Github repo and citing 📑 our paper:

@misc{thawakar2025llamavo1,
      title={LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs}, 
      author={Omkar Thawakar and Dinura Dissanayake and Ketan More and Ritesh Thawkar and Ahmed Heakl and Noor Ahsan and Yuhao Li and Mohammed Zumri and Jean Lahoud and Rao Muhammad Anwer and Hisham Cholakkal and Ivan Laptev and Mubarak Shah and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2501.06186},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06186}, 
}

omkarthawakar
/

LlamaV-o1

LlamaV-o1

Overview

Key Features:

Model Details

Intended Use

Out-of-Scope Use

The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.

Training Procedure

Evaluation

Benchmarks

Limitations

While the model performs well on a broad range of tasks, it may struggle with: - Highly technical, domain-specific knowledge outside the training corpus. - Generating accurate outputs for ambiguous or adversarial prompts.

Usage

Results

Training Data

Training Procedure

Citation

Model tree for omkarthawakar/LlamaV-o1

Datasets used to train omkarthawakar/LlamaV-o1

Space using omkarthawakar/LlamaV-o1 1