Skip to main navigation Skip to search Skip to main content

ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

Zikang Zhou*, Hengjian Zhou, Haibo Hu, Zihao Wen, Jianping Wang, Yung-Hui Li, Yu-Kai Huang

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and uncalibrated mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a postprocessing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the EarlyMatch-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or heuristic post-processing, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain. © 2025 IEEE
Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Subtitle of host publicationCVPR 2025
PublisherIEEE
Pages1612-1621
ISBN (Electronic)979-8-3315-4364-8
ISBN (Print)979-8-3315-4365-5
DOIs
Publication statusPublished - 2025

Publication series

Name
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Bibliographical note

Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).

Funding

This project is supported by a grant from Hong Kong Research Grant Council under CRF C1042-23G.

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling'. Together they form a unique fingerprint.

Cite this