steveyin
's Collections
Good Papers
updated
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
Spectrally Pruned Gaussian Fields with Neural Compensation
Paper
•
2405.00676
•
Published
•
8
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper
•
2404.18212
•
Published
•
27
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
118
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
•
2405.08344
•
Published
•
12
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
23
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
•
2405.11473
•
Published
•
53
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Paper
•
2406.02523
•
Published
•
9
Towards a Personal Health Large Language Model
Paper
•
2406.06474
•
Published
•
17
Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering
for HDR View Synthesis
Paper
•
2406.06216
•
Published
•
18
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
22
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper
•
2406.06469
•
Published
•
23
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
57
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
21
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
•
2406.18082
•
Published
•
47
Paper
•
2406.09414
•
Published
•
92
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
36
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
Transformers meet Neural Algorithmic Reasoners
Paper
•
2406.09308
•
Published
•
43
MotionClone: Training-Free Motion Cloning for Controllable Video
Generation
Paper
•
2406.05338
•
Published
•
39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
32
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
•
2406.04338
•
Published
•
34
The Prompt Report: A Systematic Survey of Prompting Techniques
Paper
•
2406.06608
•
Published
•
53
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
•
2406.07550
•
Published
•
55
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion
Models
Paper
•
2406.07472
•
Published
•
10
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
•
2406.04692
•
Published
•
55
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
•
2406.01014
•
Published
•
30
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper
•
2406.02430
•
Published
•
29
Agentless: Demystifying LLM-based Software Engineering Agents
Paper
•
2407.01489
•
Published
•
42
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
92
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
•
2407.02869
•
Published
•
18
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Paper
•
2407.04620
•
Published
•
27
Internet of Agents: Weaving a Web of Heterogeneous Agents for
Collaborative Intelligence
Paper
•
2407.07061
•
Published
•
26
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
Paper
•
2407.06938
•
Published
•
21
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Paper
•
2406.18009
•
Published
•
18
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
•
2406.19741
•
Published
•
59
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
24
Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On
Paper
•
2406.10539
•
Published
•
1
Cross Anything: General Quadruped Robot Navigation through Complex
Terrains
Paper
•
2407.16412
•
Published
•
4
A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data
Paper
•
2407.16680
•
Published
•
11
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation
Paper
•
2407.14931
•
Published
•
20
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
•
2407.14177
•
Published
•
42
The Vision of Autonomic Computing: Can LLMs Make It a Reality?
Paper
•
2407.14402
•
Published
•
13
Internal Consistency and Self-Feedback in Large Language Models: A
Survey
Paper
•
2407.14507
•
Published
•
46
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix"
Cycle
Paper
•
2407.13833
•
Published
•
11
3D Gaussian Editing with A Single Image
Paper
•
2408.07540
•
Published
•
10
Segment Anything with Multiple Modalities
Paper
•
2408.09085
•
Published
•
21
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse
Views
Paper
•
2408.10195
•
Published
•
12
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices
Paper
•
2408.10161
•
Published
•
11
Surgical SAM 2: Real-time Segment Anything in Surgical Video by
Efficient Frame Pruning
Paper
•
2408.07931
•
Published
•
18
Automated Design of Agentic Systems
Paper
•
2408.08435
•
Published
•
38
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
116
gsplat: An Open-Source Library for Gaussian Splatting
Paper
•
2409.06765
•
Published
•
12
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
55
GST: Precise 3D Human Body from a Single Image with Gaussian Splatting
Transformers
Paper
•
2409.04196
•
Published
•
11
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
52
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
Paper
•
2408.16768
•
Published
•
26
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
23
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
•
2409.11564
•
Published
•
19
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of
Tasks, Techniques, and Trends
Paper
•
2409.14195
•
Published
•
11
Robot See Robot Do: Imitating Articulated Object Manipulation with
Monocular 4D Reconstruction
Paper
•
2409.18121
•
Published
•
7
Disco4D: Disentangled 4D Human Generation and Animation from a Single
Image
Paper
•
2409.17280
•
Published
•
9
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case
Study
Paper
•
2409.17580
•
Published
•
7
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper
•
2410.01744
•
Published
•
25
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Paper
•
2410.00531
•
Published
•
28
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Paper
•
2410.02073
•
Published
•
40
FAN: Fourier Analysis Networks
Paper
•
2410.02675
•
Published
•
24
Paper
•
2410.05258
•
Published
•
165
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
30
Revealing the Barriers of Language Agents in Planning
Paper
•
2410.12409
•
Published
•
23
What Matters in Transformers? Not All Attention is Needed
Paper
•
2406.15786
•
Published
•
27
EchoPrime: A Multi-Video View-Informed Vision-Language Model for
Comprehensive Echocardiography Interpretation
Paper
•
2410.09704
•
Published
•
11
Benchmarking Agentic Workflow Generation
Paper
•
2410.07869
•
Published
•
25
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
•
2410.08164
•
Published
•
24
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks
at Scale
Paper
•
2409.16299
•
Published
•
9
DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes
Paper
•
2410.18084
•
Published
•
12
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Paper
•
2410.13924
•
Published
•
6
LLM-based Optimization of Compound AI Systems: A Survey
Paper
•
2410.16392
•
Published
•
13
Improve Vision Language Model Chain-of-thought Reasoning
Paper
•
2410.16198
•
Published
•
17
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
•
2410.11190
•
Published
•
20
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
•
2410.18975
•
Published
•
34
WAFFLE: Multi-Modal Model for Automated Front-End Development
Paper
•
2410.18362
•
Published
•
11
A Survey of Small Language Models
Paper
•
2410.20011
•
Published
•
36
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized
Generalist Computer Assistant
Paper
•
2410.18603
•
Published
•
30
Paper
•
2410.21276
•
Published
•
76
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A
Gradient Perspective
Paper
•
2410.23743
•
Published
•
57