TY - JOUR
T1 - Contrastive semantic similarity learning for image captioning evaluation
AU - Zeng, Chao
AU - Kwong, Sam
AU - Zhao, Tiesong
AU - Wang, Hanli
PY - 2022/9
Y1 - 2022/9
N2 - Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric I2 CE (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations–single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric I2 CE achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones.
AB - Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric I2 CE (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations–single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric I2 CE achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones.
KW - Auto-encoder
KW - Contrastive learning
KW - Image captioning evaluation
KW - Sentence representations
UR - http://www.scopus.com/inward/record.url?scp=85135107631&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85135107631&origin=recordpage
U2 - 10.1016/j.ins.2022.07.142
DO - 10.1016/j.ins.2022.07.142
M3 - RGC 21 - Publication in refereed journal
SN - 0020-0255
VL - 609
SP - 913
EP - 930
JO - Information Sciences
JF - Information Sciences
ER -