Bridging Vision and Language via Image Captioning


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date16 Sept 2021


In terms of Artificial Intelligence (AI), it is believed that an AI should have the abilities that we humans have, such as perception and cognition. Generally, perception refers to how humans obtain information from the surroundings, for example, humans normally obtain more than 70% information via vision and the rest via hearing, touching, smelling, tasting and so on. In contrast, cognition refers to a high-level process of the perceived information, such as translating an object that we see into the corresponding concept that we can understand. The tasks of computer vision such as recognition, multi-label classification, object detection and segmentation are able to translate visual information into concepts to help AIs understand what they see, while humans have the ability of describing what they see using language, which contains more information than a single tag or multiple tags, since when humans describe a scene, their particular knowledge could be introduced into the sentences, resulting in personal understanding of the scene, which is highly related to cognition.

The task of image captioning mimics the ability of humans that use language to describe scenes, which bridges vision and language. In this dissertation, we concentrate on describing images using natural languages and propose multiple models for image captioning, covering single caption, diverse captions and captioning with references. Generally, neural image captioning models adopt long-short term memory (LSTM) networks as the language model, which cannot be calculated in a parallel manner, since to compute the current state, we have to compute the previous state first, hence the computational complexity is proportional to the length of sentences. Whereas the proposed convolutional image captioning model can be implemented in the parallel manner and the computational complexity is proportional to the number of the convolutional layers. Moreover, convolutional neural networks (CNNs) for modeling language is able to provide multi-level representations of sentences, which is similar to the parsing tree, while LSTMs normally only have single-level representation.

A framework for evaluating and generating diverse captions is also proposed in the dissertation. To evaluate the diversity of a set of captions and inspired by latent semantic analysis (LSA), we find that the sigular values of the term-freqency matrix are able to reflect the diversity of captions. In addition, we employ CIDEr similarity to kernelize LSA and develop a diversity metric, which is highly related to human judgment on diversity. Furthermore, we employ reinforcement learning (RL) to directly optimize the accuracy and diversity scores, yielding both accurate and diverse captions for a condidate image. Inspired by determinantal point process (DPP), we develop a selection algorithm that is able to further improve accuracy and diversity.

In addition, we incorperate similar image features to enhance the co-occurred concepts for image captioning, developing an attention-in-attention (AiA) model. Given an image, we firt retrieve its semantically similar images and then construct a KNN graph. After that, our proposed AiA model is applied to fuse the features and the language decoder would generate a caption from the fused features. The experimental results show that the proposed approach outperforms the baseline model and the performance is also competitive compared with the state-of-the-art performances.

Finally, we investigate two-stage image captioning models, where we proposed a multimodal auto-encoder for image captioning. The encoder fuses the information of images and texts and in the training phase, the texts could be ground-truth captions, retrieved captions and captions generated by other models, while in the testing phase, retrieved captions and generated captions are employed. In traditional image captioning models, only image features are used to generate captions, however, it could be difficult to learn the mapping from image features to words. Fortunately, we can introduce captions obtained by either retrieval models or other image captioning models to enhance word generation. The proposed multimodal auto-encoder framework significantly improve the performance, for example, we achieve CIDEr (Consensus-based Image Description Evaluation) of 130.37±0.12 using attention-on-attention decoder with cross-entropy loss, which outperforms other two-stage captioning models and most transformer-based decoders.

    Research areas

  • Image captioning, Reinforcement learning, Vision and language, Diverse image captions