Question Answering
Safetensors
English
mllama
omkarthawakar commited on
Commit
01a18cb
verified
1 Parent(s): 55b0938

Initial README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - omkarthawakar/VRC-Bench
5
+ - Xkev/LLaVA-CoT-100k
6
+ language:
7
+ - en
8
+ base_model:
9
+ - meta-llama/Llama-3.2-11B-Vision-Instruct
10
+ pipeline_tag: question-answering
11
+ ---
12
+
13
+ ## Overview
14
+ **LlamaV-o1** is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks.
15
+ Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search,
16
+ LlamaV-o1 demonstrates exceptional performance across diverse benchmarks.
17
+ It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception,
18
+ mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.
19
+
20
+ The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach,
21
+ LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research
22
+ and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations,
23
+ LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.
24
+
25
+ ### Key Features:
26
+ - **Model Size:** 11 billion parameters.
27
+ - **Architecture:** Based on the Llama (Large Language Model Architecture) family.
28
+ - **Fine-Tuning:** Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
29
+ - **Applications:** Ideal for use cases such as conversational agents, educational tools, content creation, and more.
30
+ ---
31
+ ## Model Details
32
+ - **Developed By:** MBZUAI
33
+ - **Model Version:** v0.1
34
+ - **Release Date:** 13th January 2025
35
+ - **Training Dataset:** Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
36
+ - **Framework:** Pytorch
37
+ ---
38
+
39
+ ## Intended Use
40
+ **LlamaV-o1** is designed for a wide range of NLP tasks, including but not limited to:
41
+ - Text Generation
42
+ - Sentiment Analysis
43
+ - Text Summarization
44
+ - Question Answering
45
+ - Chain-of-Thought Reasoning
46
+
47
+ ### Out-of-Scope Use
48
+ The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.
49
+ ---
50
+
51
+ ## Training Procedure
52
+ - **Fine-Tuning:** The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
53
+ - **Optimizations:** Includes inference scaling optimizations to balance performance and computational efficiency.
54
+ ---
55
+ ## Evaluation
56
+
57
+ ### Benchmarks
58
+ LlamaV-o1 has been evaluated on a suite of benchmark tasks:
59
+ - **Reasoning:** [VCR-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench)
60
+
61
+
62
+ ### Limitations
63
+ While the model performs well on a broad range of tasks, it may struggle with:
64
+ - Highly technical, domain-specific knowledge outside the training corpus.
65
+ - Generating accurate outputs for ambiguous or adversarial prompts.
66
+ ---
67
+ ## Usage
68
+ ```python
69
+ from transformers import MllamaForConditionalGeneration, AutoProcessor
70
+
71
+ model_id = "omkarthawakar/LlamaV-o1"
72
+
73
+ model = MllamaForConditionalGeneration.from_pretrained(
74
+ model_id,
75
+ torch_dtype=torch.bfloat16,
76
+ device_map="auto",
77
+ )
78
+ processor = AutoProcessor.from_pretrained(model_id)
79
+ ```
80
+
81
+ Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference.
82
+
83
+ ### Results
84
+ **Table 1:** Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.
85
+
86
+ | **Model** | **GPT-4o** | **Claude-3.5** | **Gemini-2.0** | **Gemini-1.5 Pro** | **Gemini-1.5 Flash** | **GPT-4o Mini** | **Llama-3.2 Vision** | **Mulberry** | **Llava-CoT** | **LlamaV-o1 (Ours)** |
87
+ |-------------|------------|----------------|----------------|-------------------|--------------------|----------------|--------------------|-------------|--------------|-------------------|
88
+ | **Final Answer** | 59.28 | **61.35** | 61.16 | **61.35** | 54.99 | 56.39 | 48.40 | 51.90 | 54.09 | **56.49** |
89
+ | **Reasoning Steps** | **76.68** | 72.12 | 74.08 | 72.12 | 71.86 | 74.05 | 58.37 | 63.86 | 66.21 | **68.93** |
90
+ ---
91
+
92
+ ### Training Data
93
+
94
+ LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k).
95
+ We have formatted training sample for multi-step reasoning.
96
+
97
+ ### Training Procedure
98
+
99
+ LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes).
100
+ Detailed Training procedure will be coming soon!
101
+
102
+ ### Citation
103
+ Coming Soon!
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+
112
+