Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations

Hao Wu*, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences. © 2019 IEEE.
Original languageEnglish
Title of host publicationProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PublisherIEEE Computer Society
Pages6602-6611
Number of pages10
Volume2019-June
ISBN (Print)9781728132938
DOIs
Publication statusPublished - 1 Jun 2019
Externally publishedYes
Event32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States
Duration: 16 Jun 201920 Jun 2019

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2019-June
ISSN (Print)1063-6919

Conference

Conference32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PlaceUnited States
CityLong Beach
Period16/06/1920/06/19

Bibliographical note

Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].

Funding

This research is supported in part by the National Key Research and Development Program of China under grant 2018YFB0505000 and the National Natural Science Foundation of China under grant 61772138.

Research Keywords

  • Representation Learning
  • Vision + Language

Fingerprint

Dive into the research topics of 'Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations'. Together they form a unique fingerprint.

Cite this