Skip to main navigation Skip to search Skip to main content

Hi-RQCT: Hierarchical Residual-Quantized Causal Transformer for High-Quality 3D Human Motion Generation

  • Dongjie Fu (Co-first Author)
  • , Tengjiao Sun (Co-first Author)
  • , Pengcheng Fang (Co-first Author)
  • , Yiyang Zhang
  • , Hansung Kim*
  • *Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

1 Downloads (CityUHK Scholars)

Abstract

Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present Hi-RQCT (Hierarchical Residual-Quantized Causal Transformer), which generates high-quality lifelike 3D human motions by training a single transformer model. Hi-RQCT consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Hi-RQCT can generate smooth and continuous motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, our model achieves the best quantitative performance, and the generated motions also exhibit highly realistic and expressive visual quality in qualitative evaluations. © 2025 Copyright held by the owner/author(s).
Original languageEnglish
Title of host publicationCVMP '25
Subtitle of host publicationProceedings of the 22nd ACM SIGGRAPH European Conference on Visual Media Production
PublisherAssociation for Computing Machinery
Number of pages11
ISBN (Print)9798400721175
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event22nd ACM SIGGRAPH European Conference on Visual Media Production (CVMP 2025) - London, United Kingdom
Duration: 3 Dec 20254 Dec 2025

Publication series

NameProceedings CVMP - The ACM SIGGRAPH European Conference on Visual Media Production

Conference

Conference22nd ACM SIGGRAPH European Conference on Visual Media Production (CVMP 2025)
PlaceUnited Kingdom
CityLondon
Period3/12/254/12/25

Funding

This work was supported by the EPSRC Programme Grant Immersive Audio-Visual 3D Scene Reproduction (EP/V03538X/1).

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Hi-RQCT: Hierarchical Residual-Quantized Causal Transformer for High-Quality 3D Human Motion Generation'. Together they form a unique fingerprint.

Cite this