matlok
's Collections
Multimodal Papers
updated
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
•
2401.01885
•
Published
•
27
Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
Paper
•
2401.15687
•
Published
•
22
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
26
MouSi: Poly-Visual-Expert Vision-Language Models
Paper
•
2401.17221
•
Published
•
8
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
49
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
7
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative
Editing
Paper
•
2310.12404
•
Published
•
15
Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
Paper
•
2309.10020
•
Published
•
40
Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
Paper
•
2310.00653
•
Published
•
3
Sequence to Sequence Learning with Neural Networks
Paper
•
1409.3215
•
Published
•
3
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
15
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
48
Paper
•
2309.16609
•
Published
•
34
Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models
Paper
•
2311.07919
•
Published
•
9
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Paper
•
2402.08017
•
Published
•
25
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
37
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
13
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42