arxiv:2411.16781

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Published on Nov 25

· Submitted by

LiyiGang on Nov 28

Upvote

Authors:

Yiheng Li ,

Ruibing Hou ,

Xilin Chen

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

View arXiv page View PDF Add to collection

Community

LiyiGang

Paper author Paper submitter 5 days ago

This paper introduces UniPose, a unified framework that utilizes LLM to comprehend, generate, and edit human poses across diverse modalities (images, text, and 3D SMPL poses). UniPose employs a pose tokenizer to convert 3D poses into discrete tokens, enabling seamless integration into the LLM’s vocabulary. Additionally, it incorporates a mix of visual encoders, including a pose-specific encoder, to enhance fine-grained pose perception. UniPose effectively transfers knowledge across tasks through a unified learning strategy, and adapts to unseen challenges. As the first general-purpose framework for pose understanding, generation, and editing, UniPose performs various pose-related tasks competitively.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.16781 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.16781 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.16781 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.