OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation Paper • 2412.09585 • Published 14 days ago • 10
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension Paper • 2412.03704 • Published 22 days ago • 6
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation Paper • 2410.23277 • Published Oct 30 • 9
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Paper • 2408.00765 • Published Aug 1 • 12
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published Apr 25 • 16
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Paper • 2311.07562 • Published Nov 13, 2023 • 13
MM-VID: Advancing Video Understanding with GPT-4V(ision) Paper • 2310.19773 • Published Oct 30, 2023 • 19
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design Paper • 2310.15144 • Published Oct 23, 2023 • 13
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Paper • 2310.11441 • Published Oct 17, 2023 • 26
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation Paper • 2310.07749 • Published Oct 11, 2023 • 5
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation Paper • 2310.08541 • Published Oct 12, 2023 • 17
Multimodal Foundation Models: From Specialists to General-Purpose Assistants Paper • 2309.10020 • Published Sep 18, 2023 • 40
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Paper • 2308.02490 • Published Aug 4, 2023 • 16
DisCo: Disentangled Control for Referring Human Dance Generation in Real World Paper • 2307.00040 • Published Jun 30, 2023 • 25