Abstract
Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present Hi-RQCT (Hierarchical Residual-Quantized Causal Transformer), which generates high-quality lifelike 3D human motions by training a single transformer model. Hi-RQCT consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Hi-RQCT can generate smooth and continuous motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, our model achieves the best quantitative performance, and the generated motions also exhibit highly realistic and expressive visual quality in qualitative evaluations. © 2025 Copyright held by the owner/author(s).
| Original language | English |
|---|---|
| Title of host publication | CVMP '25 |
| Subtitle of host publication | Proceedings of the 22nd ACM SIGGRAPH European Conference on Visual Media Production |
| Publisher | Association for Computing Machinery |
| Number of pages | 11 |
| ISBN (Print) | 9798400721175 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
| Event | 22nd ACM SIGGRAPH European Conference on Visual Media Production (CVMP 2025) - London, United Kingdom Duration: 3 Dec 2025 → 4 Dec 2025 |
Publication series
| Name | Proceedings CVMP - The ACM SIGGRAPH European Conference on Visual Media Production |
|---|
Conference
| Conference | 22nd ACM SIGGRAPH European Conference on Visual Media Production (CVMP 2025) |
|---|---|
| Place | United Kingdom |
| City | London |
| Period | 3/12/25 → 4/12/25 |
Funding
This work was supported by the EPSRC Programme Grant Immersive Audio-Visual 3D Scene Reproduction (EP/V03538X/1).
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/
Fingerprint
Dive into the research topics of 'Hi-RQCT: Hierarchical Residual-Quantized Causal Transformer for High-Quality 3D Human Motion Generation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver