omkarthawakar
commited on
Initial README.md
Browse files
README.md
ADDED
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- omkarthawakar/VRC-Bench
|
5 |
+
- Xkev/LLaVA-CoT-100k
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
base_model:
|
9 |
+
- meta-llama/Llama-3.2-11B-Vision-Instruct
|
10 |
+
pipeline_tag: question-answering
|
11 |
+
---
|
12 |
+
|
13 |
+
## Overview
|
14 |
+
**LlamaV-o1** is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks.
|
15 |
+
Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search,
|
16 |
+
LlamaV-o1 demonstrates exceptional performance across diverse benchmarks.
|
17 |
+
It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception,
|
18 |
+
mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.
|
19 |
+
|
20 |
+
The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach,
|
21 |
+
LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research
|
22 |
+
and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations,
|
23 |
+
LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.
|
24 |
+
|
25 |
+
### Key Features:
|
26 |
+
- **Model Size:** 11 billion parameters.
|
27 |
+
- **Architecture:** Based on the Llama (Large Language Model Architecture) family.
|
28 |
+
- **Fine-Tuning:** Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
|
29 |
+
- **Applications:** Ideal for use cases such as conversational agents, educational tools, content creation, and more.
|
30 |
+
---
|
31 |
+
## Model Details
|
32 |
+
- **Developed By:** MBZUAI
|
33 |
+
- **Model Version:** v0.1
|
34 |
+
- **Release Date:** 13th January 2025
|
35 |
+
- **Training Dataset:** Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
|
36 |
+
- **Framework:** Pytorch
|
37 |
+
---
|
38 |
+
|
39 |
+
## Intended Use
|
40 |
+
**LlamaV-o1** is designed for a wide range of NLP tasks, including but not limited to:
|
41 |
+
- Text Generation
|
42 |
+
- Sentiment Analysis
|
43 |
+
- Text Summarization
|
44 |
+
- Question Answering
|
45 |
+
- Chain-of-Thought Reasoning
|
46 |
+
|
47 |
+
### Out-of-Scope Use
|
48 |
+
The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.
|
49 |
+
---
|
50 |
+
|
51 |
+
## Training Procedure
|
52 |
+
- **Fine-Tuning:** The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
|
53 |
+
- **Optimizations:** Includes inference scaling optimizations to balance performance and computational efficiency.
|
54 |
+
---
|
55 |
+
## Evaluation
|
56 |
+
|
57 |
+
### Benchmarks
|
58 |
+
LlamaV-o1 has been evaluated on a suite of benchmark tasks:
|
59 |
+
- **Reasoning:** [VCR-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench)
|
60 |
+
|
61 |
+
|
62 |
+
### Limitations
|
63 |
+
While the model performs well on a broad range of tasks, it may struggle with:
|
64 |
+
- Highly technical, domain-specific knowledge outside the training corpus.
|
65 |
+
- Generating accurate outputs for ambiguous or adversarial prompts.
|
66 |
+
---
|
67 |
+
## Usage
|
68 |
+
```python
|
69 |
+
from transformers import MllamaForConditionalGeneration, AutoProcessor
|
70 |
+
|
71 |
+
model_id = "omkarthawakar/LlamaV-o1"
|
72 |
+
|
73 |
+
model = MllamaForConditionalGeneration.from_pretrained(
|
74 |
+
model_id,
|
75 |
+
torch_dtype=torch.bfloat16,
|
76 |
+
device_map="auto",
|
77 |
+
)
|
78 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
79 |
+
```
|
80 |
+
|
81 |
+
Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference.
|
82 |
+
|
83 |
+
### Results
|
84 |
+
**Table 1:** Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.
|
85 |
+
|
86 |
+
| **Model** | **GPT-4o** | **Claude-3.5** | **Gemini-2.0** | **Gemini-1.5 Pro** | **Gemini-1.5 Flash** | **GPT-4o Mini** | **Llama-3.2 Vision** | **Mulberry** | **Llava-CoT** | **LlamaV-o1 (Ours)** |
|
87 |
+
|-------------|------------|----------------|----------------|-------------------|--------------------|----------------|--------------------|-------------|--------------|-------------------|
|
88 |
+
| **Final Answer** | 59.28 | **61.35** | 61.16 | **61.35** | 54.99 | 56.39 | 48.40 | 51.90 | 54.09 | **56.49** |
|
89 |
+
| **Reasoning Steps** | **76.68** | 72.12 | 74.08 | 72.12 | 71.86 | 74.05 | 58.37 | 63.86 | 66.21 | **68.93** |
|
90 |
+
---
|
91 |
+
|
92 |
+
### Training Data
|
93 |
+
|
94 |
+
LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k).
|
95 |
+
We have formatted training sample for multi-step reasoning.
|
96 |
+
|
97 |
+
### Training Procedure
|
98 |
+
|
99 |
+
LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes).
|
100 |
+
Detailed Training procedure will be coming soon!
|
101 |
+
|
102 |
+
### Citation
|
103 |
+
Coming Soon!
|
104 |
+
|
105 |
+
|
106 |
+
|
107 |
+
|
108 |
+
|
109 |
+
|
110 |
+
|
111 |
+
|
112 |
+
|