Skip to main navigation Skip to search Skip to main content

HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

  • Zhiwei Chen
  • , Yupeng Hu*
  • , Zixu Li
  • , Zhiheng Fu
  • , Haokun Wen
  • , Weili Guan
  • *Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/. © 2025 ACM.
Original languageEnglish
Title of host publicationMM '25 - Proceedings of the 33rd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery
Pages6143-6152
ISBN (Print)9798400720352
DOIs
Publication statusPublished - Oct 2025
Event33rd ACM International Conference on Multimedia (MM '25) - Royal Dublin Convention Centre, Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025
https://acmmm2025.org/

Publication series

NameMM - Proceedings of the ACM International Conference on Multimedia

Conference

Conference33rd ACM International Conference on Multimedia (MM '25)
Abbreviated titleACM Multimedia 2025
PlaceIreland
CityDublin
Period27/10/2531/10/25
Internet address

Funding

This work was supported in part by the National Natural Science Foundation of China, No.:62276155, No.:62476071, No.:U24A20328, and No.:624B2047; in part by the Guangdong Basic and Applied Basic Research Foundation, No.:2025A1515011732; in part by the China National University Student Innovation & Entrepreneurship Development Program, No.:202410422071.

Research Keywords

  • composed video retrieval
  • multimodal query composition

Fingerprint

Dive into the research topics of 'HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval'. Together they form a unique fingerprint.

Cite this