Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations


Related Research Unit(s)


Original languageEnglish
Article number63
Journal / PublicationACM Transactions on Multimedia Computing, Communications and Applications
Issue number2
Online publishedFeb 2022
Publication statusPublished - May 2022


Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

Research Area(s)

  • Multi-grained fusion, Spatial-temporal object relationships, Text-video retrieval