Abstract
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
Community
In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning like GPT-o1. Our 11B model outperforms Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The key is training on structured data and a novel inference time scaling method—stage-level beam search
congrats, would be great to upload the model, here is the guide: https://huggingface.co/docs/hub/models-uploading
Paper summary is here: https://www.aimodels.fyi/papers/arxiv/llava-o1-let-vision-language-models-reason
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (2024)
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning (2024)
- Vision-Language Models Can Self-Improve Reasoning via Reflection (2024)
- Large Language Models Can Self-Improve in Long-context Reasoning (2024)
- Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend