Self-critical n-step Training for Image Captioning

Junlong Gao, Shiqi Wang, Shanshe Wang*, Siwei Ma, Wen Gao

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Existing methods for image captioning are usually trained by cross entropy loss, which leads to exposure bias and the inconsistency between the optimizing function and evaluation metrics. Recently it has been shown that these two issues can be addressed by incorporating techniques from reinforcement learning, where one of the popular techniques is the advantage actor-critic algorithm that calculates per-token advantage by estimating state value with a parametrized estimator at the cost of introducing estimation bias. In this paper, we estimate state value without using a parametrized value estimator. With the properties of image captioning, namely, the deterministic state transition function and the sparse reward, state value is equivalent to its preceding state-action value, and we reformulate advantage function by simply replacing the former with the latter. Moreover, the reformulated advantage is extended to n-step, which can generally increase the absolute value of the mean of reformulated advantage while lowering variance. Then two kinds of rollout are adopted to estimate state-action value, which we call self-critical n-step training. Empirically we find that our method can obtain better performance compared to the state-of-the-art methods that use the sequence level advantage and parametrized estimator respectively on the widely used MSCOCO benchmark.
Original languageEnglish
Title of host publicationProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublisherIEEE Computer Society
Pages6293-6301
Volume2019-June
ISBN (Print)9781728132938
DOIs
Publication statusPublished - Jun 2019
Event32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) - Long Beach, United States
Duration: 16 Jun 201920 Jun 2019
http://cvpr2019.thecvf.com/

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2019-June
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019)
PlaceUnited States
CityLong Beach
Period16/06/1920/06/19
Internet address

Research Keywords

  • Deep Learning
  • Vision + Language
  • Visual Reasoning

Fingerprint

Dive into the research topics of 'Self-critical n-step Training for Image Captioning'. Together they form a unique fingerprint.

Cite this