Vision Language Models - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Image Generation

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

Video Generation

LM Prompt Engineering

LM Capabilities and Scaling

Music Generation

LM Architectures

Code Generation

Speech Synthesis

EXL2 Quantized Models

Vision Language Models

updated Oct 2

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18 • 24
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19 • 29
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19 • 30
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9 • 29
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8 • 81
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8 • 20
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4 • 25
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Paper • 2311.16502 • Published Nov 27, 2023 • 35
Kosmos-2: Grounding Multimodal Large Language Models to the World

Paper • 2306.14824 • Published Jun 26, 2023 • 34
CogVLM: Visual Expert for Pretrained Language Models

Paper • 2311.03079 • Published Nov 6, 2023 • 23
Pegasus-v1 Technical Report

Paper • 2404.14687 • Published Apr 23 • 30
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25 • 53
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Paper • 2404.16375 • Published Apr 25 • 16
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25 • 7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25 • 35
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3 • 100
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31 • 20
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6 • 72
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13 • 50
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13 • 36
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13 • 18
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13 • 19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13 • 15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17 • 61
VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17 • 22
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Paper • 2406.09961 • Published Jun 14 • 54
Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11 • 52
Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26 • 31
Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model

Paper • 2408.00754 • Published Aug 1 • 21
OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1 • 24
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22 • 50
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2 • 27
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18 • 74
NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26 • 33
OmniBench: Towards The Future of Universal Omni-Language Models

Paper • 2409.15272 • Published Sep 23 • 26