# (CVPR 2023) T2M-GPT Pytorch implementation of paper "T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations" [[Project Page]](https://mael-zys.github.io/T2M-GPT/) [[Paper]](https://arxiv.org/abs/2301.06052) [[Notebook Demo]](https://colab.research.google.com/drive/1Vy69w2q2d-Hg19F-KibqG0FRdpSj3L4O?usp=sharing) [[HuggingFace]](https://huggingface.co/vumichien/T2M-GPT) [[Space Demo]](https://huggingface.co/spaces/vumichien/generate_human_motion)

teaser

If our project is helpful for your research, please consider citing : ``` @inproceedings{zhang2023generating, title={T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations}, author={Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2023}, } ``` ## Table of Content * [1. Visual Results](#1-visual-results) * [2. Installation](#2-installation) * [3. Quick Start](#3-quick-start) * [4. Train](#4-train) * [5. Evaluation](#5-evaluation) * [6. SMPL Mesh Rendering](#6-smpl-mesh-rendering) * [7. Acknowledgement](#7-acknowledgement) * [8. ChangLog](#8-changlog) ## 1. Visual Results (More results can be found in our [project page](https://mael-zys.github.io/T2M-GPT/))

Text: a man steps forward and does a handstand.

GT T2M MDM MotionDiffuse Ours

gif gif gif gif gif

Text: A man rises from the ground, walks in a circle and sits back down on the ground.

GT T2M MDM MotionDiffuse Ours

gif gif gif gif gif

Text: a man steps forward and does a handstand.
GT	T2M	MDM	MotionDiffuse	Ours

Text: A man rises from the ground, walks in a circle and sits back down on the ground.
GT	T2M	MDM	MotionDiffuse	Ours

## 2. Installation ### 2.1. Environment Our model can be learnt in a **single GPU V100-32G** ```bash conda env create -f environment.yml conda activate T2M-GPT ``` The code was tested on Python 3.8 and PyTorch 1.8.1. ### 2.2. Dependencies ```bash bash dataset/prepare/download_glove.sh ``` ### 2.3. Datasets We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download link [[here]](https://github.com/EricGuo5513/HumanML3D). Take HumanML3D for an example, the file directory should look like this: ``` ./dataset/HumanML3D/ ├── new_joint_vecs/ ├── texts/ ├── Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) ├── Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) ├── train.txt ├── val.txt ├── test.txt ├── train_val.txt └── all.txt ``` ### 2.4. Motion & text feature extractors: We use the same extractors provided by [t2m](https://github.com/EricGuo5513/text-to-motion) to evaluate our generated motions. Please download the extractors. ```bash bash dataset/prepare/download_extractor.sh ``` ### 2.5. Pre-trained models The pretrained model files will be stored in the 'pretrained' folder: ```bash bash dataset/prepare/download_model.sh ``` ### 2.6. Render SMPL mesh (optional) If you want to render the generated motion, you need to install: ```bash sudo sh dataset/prepare/download_smpl.sh conda install -c menpo osmesa conda install h5py conda install -c conda-forge shapely pyrender trimesh mapbox_earcut ``` ## 3. Quick Start A quick start guide of how to use our code is available in [demo.ipynb](https://colab.research.google.com/drive/1Vy69w2q2d-Hg19F-KibqG0FRdpSj3L4O?usp=sharing)

demo

## 4. Train Note that, for kit dataset, just need to set '--dataname kit'. ### 4.1. VQ-VAE The results are saved in the folder output.

VQ training

```bash python3 train_vq.py \ --batch-size 256 \ --lr 2e-4 \ --total-iter 300000 \ --lr-scheduler 200000 \ --nb-code 512 \ --down-t 2 \ --depth 3 \ --dilation-growth-rate 3 \ --out-dir output \ --dataname t2m \ --vq-act relu \ --quantizer ema_reset \ --loss-vel 0.5 \ --recons-loss l1_smooth \ --exp-name VQVAE ```

### 4.2. GPT The results are saved in the folder output.

GPT training

```bash python3 train_t2m_trans.py \ --exp-name GPT \ --batch-size 128 \ --num-layers 9 \ --embed-dim-gpt 1024 \ --nb-code 512 \ --n-head-gpt 16 \ --block-size 51 \ --ff-rate 4 \ --drop-out-rate 0.1 \ --resume-pth output/VQVAE/net_last.pth \ --vq-name VQVAE \ --out-dir output \ --total-iter 300000 \ --lr-scheduler 150000 \ --lr 0.0001 \ --dataname t2m \ --down-t 2 \ --depth 3 \ --quantizer ema_reset \ --eval-iter 10000 \ --pkeep 0.5 \ --dilation-growth-rate 3 \ --vq-act relu ```

## 5. Evaluation ### 5.1. VQ-VAE

VQ eval

```bash python3 VQ_eval.py \ --batch-size 256 \ --lr 2e-4 \ --total-iter 300000 \ --lr-scheduler 200000 \ --nb-code 512 \ --down-t 2 \ --depth 3 \ --dilation-growth-rate 3 \ --out-dir output \ --dataname t2m \ --vq-act relu \ --quantizer ema_reset \ --loss-vel 0.5 \ --recons-loss l1_smooth \ --exp-name TEST_VQVAE \ --resume-pth output/VQVAE/net_last.pth ```

### 5.2. GPT

GPT eval

Follow the evaluation setting of [text-to-motion](https://github.com/EricGuo5513/text-to-motion), we evaluate our model 20 times and report the average result. Due to the multimodality part where we should generate 30 motions from the same text, the evaluation takes a long time. ```bash python3 GPT_eval_multi.py \ --exp-name TEST_GPT \ --batch-size 128 \ --num-layers 9 \ --embed-dim-gpt 1024 \ --nb-code 512 \ --n-head-gpt 16 \ --block-size 51 \ --ff-rate 4 \ --drop-out-rate 0.1 \ --resume-pth output/VQVAE/net_last.pth \ --vq-name VQVAE \ --out-dir output \ --total-iter 300000 \ --lr-scheduler 150000 \ --lr 0.0001 \ --dataname t2m \ --down-t 2 \ --depth 3 \ --quantizer ema_reset \ --eval-iter 10000 \ --pkeep 0.5 \ --dilation-growth-rate 3 \ --vq-act relu \ --resume-trans output/GPT/net_best_fid.pth ```

## 6. SMPL Mesh Rendering

SMPL Mesh Rendering

You should input the npy folder address and the motion names. Here is an example: ```bash python3 render_final.py --filedir output/TEST_GPT/ --motion-list 000019 005485 ```

### 7. Acknowledgement We appreciate helps from : * public code like [text-to-motion](https://github.com/EricGuo5513/text-to-motion), [TM2T](https://github.com/EricGuo5513/TM2T), [MDM](https://github.com/GuyTevet/motion-diffusion-model), [MotionDiffuse](https://github.com/mingyuan-zhang/MotionDiffuse) etc. * Mathis Petrovich, Yuming Du, Yingyi Chen, Dexiong Chen and Xuelin Chen for inspiring discussions and valuable feedback. * Minh Chien Vu for the hugging face space demo. ### 8. ChangLog * 2023/02/19 add the hugging face space demo for both skelton and SMPL mesh visualization.