SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Abstract
Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
Community
SOLAMI: 3D C.AI in VR powered by social VLA model.
SOLAMI enables the user to interact with 3D autonomous characters through speech and body language in an immersive VR environment via an end-to-end social vision-language-action model. Characters can understand users' body language, act as the users' commands, and even play simple games.
Project page: https://solami-ai.github.io/
Full video demo: https://www.youtube.com/watch?v=P0juJl2Y4So
ArXiv: https://arxiv.org/abs/2412.00174
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding (2024)
- Versatile Motion Language Models for Multi-Turn Interactive Agents (2024)
- MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension (2024)
- LLM Gesticulator: Leveraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis (2024)
- InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling (2024)
- MotionGlot: A Multi-Embodied Motion Generation Model (2024)
- OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper