FreeInit: Bridging Initialization Gap in Video Diffusion Models Paper β’ 2312.07537 β’ Published Dec 12, 2023 β’ 26
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper β’ 2311.05437 β’ Published Nov 9, 2023 β’ 47
MΓΆbius Transform for Mitigating Perspective Distortions in Representation Learning Paper β’ 2405.02296 β’ Published Mar 7 β’ 4
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields Paper β’ 2404.01300 β’ Published Apr 1 β’ 4
DriveLM: Driving with Graph Visual Question Answering Paper β’ 2312.14150 β’ Published Dec 21, 2023 β’ 4
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper β’ 2408.11039 β’ Published Aug 20 β’ 56
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering Paper β’ 2408.09174 β’ Published Aug 17 β’ 51
Meltemi: The first open Large Language Model for Greek Paper β’ 2407.20743 β’ Published Jul 30 β’ 67
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey Paper β’ 2407.21794 β’ Published Jul 31 β’ 5
Gemma 2: Improving Open Language Models at a Practical Size Paper β’ 2408.00118 β’ Published Jul 31 β’ 75
SpaceVLMs Collection Features VLMs fine-tuned for enhanced spatial reasoning using a synthetic data pipeline similar to Spatial VLM. β’ 3 items β’ Updated Jul 26 β’ 1
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding Paper β’ 2407.12594 β’ Published Jul 17 β’ 19
SEED-Story: Multimodal Long Story Generation with Large Language Model Paper β’ 2407.08683 β’ Published Jul 11 β’ 22
MambaVision: A Hybrid Mamba-Transformer Vision Backbone Paper β’ 2407.08083 β’ Published Jul 10 β’ 27
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper β’ 2407.07895 β’ Published Jul 10 β’ 40