Skip to main navigation Skip to search Skip to main content

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph auto-encoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method.
Original languageEnglish
Title of host publicationMM ’21
Subtitle of host publicationProceedings of the 29th ACM International Conference on Multimedia
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery
Pages3826-3834
ISBN (Print)9781450386517
DOIs
Publication statusPublished - Oct 2021
Event29th ACM International Conference on Multimedia (MM 2021) - Hybrid, Chengdu, China
Duration: 20 Oct 202124 Oct 2021
https://2021.acmmm.org/

Publication series

NameMM - Proceedings of the ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia (MM 2021)
Abbreviated titleMM '21
PlaceChina
CityChengdu
Period20/10/2124/10/21
Internet address

Research Keywords

  • fine-grained cross-modal alignment
  • graph auto-encoder
  • text-video retrieval

Fingerprint

Dive into the research topics of 'Fine-grained Cross-modal Alignment Network for Text-Video Retrieval'. Together they form a unique fingerprint.

Cite this