Agent S: An Open Agentic Framework that Uses Computers Like a Human Paper • 2410.08164 • Published Oct 10 • 24
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models Paper • 2407.12366 • Published Jul 17 • 4
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding Paper • 2406.19263 • Published Jun 27 • 9
Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA Paper • 2401.15847 • Published Jan 29 • 2
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing Paper • 2406.12831 • Published Jun 18 • 5
Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation Paper • 2406.09305 • Published Jun 13 • 4
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published Jun 12 • 24
Discriminative Diffusion Models as Few-shot Vision and Language Learners Paper • 2305.10722 • Published May 18, 2023 • 3
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing Paper • 2404.05717 • Published Apr 8 • 24