Describing like Humans : on Diversity in Image Captioning
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 |
Publisher | Institute of Electrical and Electronics Engineers, Inc. |
Pages | 4190-4198 |
ISBN (print) | 9781728132938 |
Publication status | Published - Jun 2019 |
Publication series
Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
---|---|
Volume | 2019-June |
ISSN (Print) | 1063-6919 |
ISSN (electronic) | 2575-7075 |
Conference
Title | 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) |
---|---|
Place | United States |
City | Long Beach |
Period | 16 - 20 June 2019 |
Link(s)
DOI | DOI |
---|---|
Document Link | |
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85078735321&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(dcfd163a-97d7-413e-92b8-3a92cf723b09).html |
Abstract
Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE and CIDEr. Does this mean we have solved the task of image captioning The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models --- the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring the diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity, and the models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions.
Research Area(s)
- Vision + Language
Citation Format(s)
Describing like Humans: on Diversity in Image Captioning. / Wang, Qingzhong; Chan, Antoni B.
Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019. Institute of Electrical and Electronics Engineers, Inc., 2019. p. 4190-4198 8954161 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2019-June).
Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019. Institute of Electrical and Electronics Engineers, Inc., 2019. p. 4190-4198 8954161 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2019-June).
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review