Abstract
Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph auto-encoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method.
| Original language | English |
|---|---|
| Title of host publication | MM ’21 |
| Subtitle of host publication | Proceedings of the 29th ACM International Conference on Multimedia |
| Place of Publication | New York, NY |
| Publisher | Association for Computing Machinery |
| Pages | 3826-3834 |
| ISBN (Print) | 9781450386517 |
| DOIs | |
| Publication status | Published - Oct 2021 |
| Event | 29th ACM International Conference on Multimedia (MM 2021) - Hybrid, Chengdu, China Duration: 20 Oct 2021 → 24 Oct 2021 https://2021.acmmm.org/ |
Publication series
| Name | MM - Proceedings of the ACM International Conference on Multimedia |
|---|
Conference
| Conference | 29th ACM International Conference on Multimedia (MM 2021) |
|---|---|
| Abbreviated title | MM '21 |
| Place | China |
| City | Chengdu |
| Period | 20/10/21 → 24/10/21 |
| Internet address |
Research Keywords
- fine-grained cross-modal alignment
- graph auto-encoder
- text-video retrieval
Fingerprint
Dive into the research topics of 'Fine-grained Cross-modal Alignment Network for Text-Video Retrieval'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver