MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents Paper • 2410.03450 • Published Oct 4 • 32
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling Paper • 2409.19291 • Published Sep 28 • 18
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 99
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection Paper • 2409.08513 • Published Sep 13 • 10
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds Paper • 2409.09213 • Published Sep 13 • 11
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization Paper • 2408.02555 • Published Aug 5 • 28