Executing your Commands via Motion Diffusion in Latent Space

2Fudan University
3ShanghaiTech University


Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to human motions, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises.

Our proposed Motion Latent Diffusion model (MLD) could produce vivid motion sequences (left) conforming to the given conditional inputs and substantially reduce the computational overhead (right) in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over other SOTAs among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models.


To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence.

Then, instead of using a diffusion model to establish the connections between the raw motion and the conditional inputs, we propose motion latent-based diffusion models to learn a better probabilistic mapping from the conditions to the representative motion latent codes.

Example generated motions from texts

Example generated motions from action labels

Text-to-Motion comparisons

Action-to-Motion comparisons



  title     = {Executing your Commands via Motion Diffusion in Latent Space},
  author    = {Chen, Xin and Jiang, Biao and Liu, Wen and Huang, Zilong and Fu, Bin and Chen, Tao and Yu, Gang},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     = {18000--18010},
  year      = {2023},