omkarthawakar
/

LlamaV-o1

Question Answering

Safetensors

English

mllama

Model card Files Files and versions Community

omkarthawakar commited on 3 days ago

Commit

01a18cb

verified ·

1 Parent(s): 55b0938

Initial README.md

Browse files

Files changed (1) hide show

README.md +112 -0

README.md ADDED Viewed

	@@ -0,0 +1,112 @@

+---
+license: apache-2.0
+datasets:
+- omkarthawakar/VRC-Bench
+- Xkev/LLaVA-CoT-100k
+language:
+- en
+base_model:
+- meta-llama/Llama-3.2-11B-Vision-Instruct
+pipeline_tag: question-answering
+---
+## Overview
+**LlamaV-o1** is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks.
+Built on a foundation of cutting-edge curriculum learning and optimized with techniques like Beam Search,
+LlamaV-o1 demonstrates exceptional performance across diverse benchmarks.
+It is fine-tuned for step-by-step reasoning, enabling it to tackle tasks in domains such as visual perception,
+mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.
+The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach,
+LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research
+and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations,
+LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.
+### Key Features:
+- **Model Size:** 11 billion parameters.
+- **Architecture:** Based on the Llama (Large Language Model Architecture) family.
+- **Fine-Tuning:** Enhanced for instruction-following, chain-of-thought reasoning, and robust generalization across tasks.
+- **Applications:** Ideal for use cases such as conversational agents, educational tools, content creation, and more.
+---
+## Model Details
+- **Developed By:** MBZUAI
+- **Model Version:** v0.1
+- **Release Date:** 13th January 2025
+- **Training Dataset:** Diverse multilingual corpus, including high-quality sources for instruction tuning, chain-of-thought datasets, and general-purpose corpora.
+- **Framework:** Pytorch
+---
+## Intended Use
+**LlamaV-o1** is designed for a wide range of NLP tasks, including but not limited to:
+- Text Generation
+- Sentiment Analysis
+- Text Summarization
+- Question Answering
+- Chain-of-Thought Reasoning
+### Out-of-Scope Use
+The model should not be used in applications requiring high-stakes decision-making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.
+---
+## Training Procedure
+- **Fine-Tuning:** The model was fine-tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction-tuning techniques to enhance usability in downstream applications.
+- **Optimizations:** Includes inference scaling optimizations to balance performance and computational efficiency.
+---
+## Evaluation
+### Benchmarks
+LlamaV-o1 has been evaluated on a suite of benchmark tasks:
+- **Reasoning:** [VCR-Bench](https://huggingface.co/datasets/omkarthawakar/VRC-Bench)
+### Limitations
+While the model performs well on a broad range of tasks, it may struggle with:
+- Highly technical, domain-specific knowledge outside the training corpus.
+- Generating accurate outputs for ambiguous or adversarial prompts.
+---
+## Usage
+```python
+from transformers import MllamaForConditionalGeneration, AutoProcessor
+model_id = "omkarthawakar/LlamaV-o1"
+model = MllamaForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained(model_id)
+```
+Please refer to [llamav-o1.py](https://github.com/mbzuai-oryx/LlamaV-o1/blob/main/eval/llamav-o1.py) for inference.
+### Results
+**Table 1:** Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC-Bench. The best results in each case (closed-source and open-source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open-source counterpart (Llava-CoT) while also being competitive against the closed-source models.
+| **Model**   | **GPT-4o** | **Claude-3.5** | **Gemini-2.0** | **Gemini-1.5 Pro** | **Gemini-1.5 Flash** | **GPT-4o Mini** | **Llama-3.2 Vision** | **Mulberry** | **Llava-CoT** | **LlamaV-o1 (Ours)** |
+|-------------|------------|----------------|----------------|-------------------|--------------------|----------------|--------------------|-------------|--------------|-------------------|
+| **Final Answer** | 59.28      | **61.35**        | 61.16          | **61.35**         | 54.99              | 56.39          | 48.40              | 51.90       | 54.09        | **56.49**         |
+| **Reasoning Steps** | **76.68**   | 72.12            | 74.08          | 72.12             | 71.86             | 74.05          | 58.37              | 63.86       | 66.21        | **68.93**         |
+---
+### Training Data
+LlamaV-o1 is trained on the [LLaVA-CoT-100k dataset](https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k).
+We have formatted training sample for multi-step reasoning.
+### Training Procedure
+LlamaV-o1 model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes).
+Detailed Training procedure will be coming soon!
+### Citation
+Coming Soon!