Abstract
We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences. © 2019 IEEE.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 |
| Publisher | IEEE Computer Society |
| Pages | 6602-6611 |
| Number of pages | 10 |
| Volume | 2019-June |
| ISBN (Print) | 9781728132938 |
| DOIs | |
| Publication status | Published - 1 Jun 2019 |
| Externally published | Yes |
| Event | 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States Duration: 16 Jun 2019 → 20 Jun 2019 |
Publication series
| Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
|---|---|
| Volume | 2019-June |
| ISSN (Print) | 1063-6919 |
Conference
| Conference | 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 |
|---|---|
| Place | United States |
| City | Long Beach |
| Period | 16/06/19 → 20/06/19 |
Bibliographical note
Publication details (e.g. title, author(s), publication statuses and dates) are captured on an “AS IS” and “AS AVAILABLE” basis at the time of record harvesting from the data source. Suggestions for further amendments or supplementary information can be sent to [email protected].Funding
This research is supported in part by the National Key Research and Development Program of China under grant 2018YFB0505000 and the National Natural Science Foundation of China under grant 61772138.
Research Keywords
- Representation Learning
- Vision + Language
Fingerprint
Dive into the research topics of 'Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver