Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Paper • 2412.13180 • Published 9 days ago • 12
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 13 days ago • 131
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Paper • 2409.19603 • Published Sep 29 • 18
Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation Paper • 2410.00890 • Published Oct 1 • 18
SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs Paper • 2410.00337 • Published Oct 1 • 10
Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation Paper • 2409.18313 • Published Sep 26 • 3
What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study Paper • 2410.00545 • Published Oct 1 • 5
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer Paper • 2410.00086 • Published Sep 30 • 10
Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration Paper • 2410.00418 • Published Oct 1 • 9
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models Paper • 2410.00231 • Published Sep 30 • 6
DressRecon: Freeform 4D Human Reconstruction from Monocular Video Paper • 2409.20563 • Published Sep 30 • 7
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision Paper • 2407.06189 • Published Jul 8 • 25
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback Paper • 2404.07987 • Published Apr 11 • 47
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15 • 32
Describing Differences in Image Sets with Natural Language Paper • 2312.02974 • Published Dec 5, 2023 • 13