Local Semantic Learning for Image Captioning
基於局部語義學習的圖像描述研究
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 28 Jun 2018 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(0dac5b1a-6a45-4d07-bf00-0d4283944a7e).html |
---|---|
Other link(s) | Links |
Abstract
Image captioning is an interdisciplinary problem in artificial intelligence that connects computer vision and natural language processing. Automatically describing the content of an image using natural languages is a challenging task which requires totally image understanding and language representation.
The recent state-of-the-art methods apply a convolutional neural network (CNN) to extract the feature of the entire image, followed by a recurrent neural network (RNN) to generate the description of the image content. This CNN-RNN pipeline attracts much research interest due to the strong representation ability of the CNN, the superior ability for sequence data processing of the RNN, and the end-to-end mechanism of neural networks.
These successful approaches, however, are limited to integrate only global visual features, and the local semantic concepts and the modality difference between visual and language spaces have not been considered. An image contains a lot of information from various aspects. Existing image captioning approaches are also limited to describing images with simple contextual information. They typically generate one sentence to describe each image with only a single contextual emphasis. In this dissertation, we address these problems from the view of local semantic learning. Three novel captioning frameworks are proposed to enhance the image understanding and enrich the captioning by introducing the semantic information of the object regions or object labels.
In each topic, extensive experimental comparisons are presented to validate the proposed models and algorithms. The main contributions of this dissertation can be summarized as follows:
1. In contrast to the previous image description methods that focus on describing the whole image, this dissertation presents a method of generating rich image descriptions from image regions. The local semantic learning is introduced via region captioning. The proposed model generates several sentence descriptions of regions in an image, which are sufficient to represent the whole image and contain more information than the captions generated by existing methods. Comparing with general image level description, generating more specific and accurate sentences on the different regions is helpful for local semantic exploring.
2. We propose to enrich the local semantic representations of images and update the language model by proposing semantic Element Embedding LSTM (EE-LSTM). Preliminary study is employed to generate the local descriptions for the object regions and the full image. The predicted descriptions and categories are used to generate the semantic features, which not only contain detailed information but also share a word space with descriptions. We further integrate the CNN features with the semantic features into the proposed model to predict an improved language description.
3. We propose the keyword-driven image captioning, which achieves the local semantic learning focus on the object concepts of the image. A new language model, Context-dependent Bilateral LSTM (CDB-LSTM), is utilized to predict a sentence driven by an additional keyword. CDB-LSTM contains two cascaded sub-models, which are unified and jointly optimized in an end-to-end training framework by considering the words dependence through a context transfer module.
The recent state-of-the-art methods apply a convolutional neural network (CNN) to extract the feature of the entire image, followed by a recurrent neural network (RNN) to generate the description of the image content. This CNN-RNN pipeline attracts much research interest due to the strong representation ability of the CNN, the superior ability for sequence data processing of the RNN, and the end-to-end mechanism of neural networks.
These successful approaches, however, are limited to integrate only global visual features, and the local semantic concepts and the modality difference between visual and language spaces have not been considered. An image contains a lot of information from various aspects. Existing image captioning approaches are also limited to describing images with simple contextual information. They typically generate one sentence to describe each image with only a single contextual emphasis. In this dissertation, we address these problems from the view of local semantic learning. Three novel captioning frameworks are proposed to enhance the image understanding and enrich the captioning by introducing the semantic information of the object regions or object labels.
In each topic, extensive experimental comparisons are presented to validate the proposed models and algorithms. The main contributions of this dissertation can be summarized as follows:
1. In contrast to the previous image description methods that focus on describing the whole image, this dissertation presents a method of generating rich image descriptions from image regions. The local semantic learning is introduced via region captioning. The proposed model generates several sentence descriptions of regions in an image, which are sufficient to represent the whole image and contain more information than the captions generated by existing methods. Comparing with general image level description, generating more specific and accurate sentences on the different regions is helpful for local semantic exploring.
2. We propose to enrich the local semantic representations of images and update the language model by proposing semantic Element Embedding LSTM (EE-LSTM). Preliminary study is employed to generate the local descriptions for the object regions and the full image. The predicted descriptions and categories are used to generate the semantic features, which not only contain detailed information but also share a word space with descriptions. We further integrate the CNN features with the semantic features into the proposed model to predict an improved language description.
3. We propose the keyword-driven image captioning, which achieves the local semantic learning focus on the object concepts of the image. A new language model, Context-dependent Bilateral LSTM (CDB-LSTM), is utilized to predict a sentence driven by an additional keyword. CDB-LSTM contains two cascaded sub-models, which are unified and jointly optimized in an end-to-end training framework by considering the words dependence through a context transfer module.
- Image captioning, Local semantic learning, EE-LSTM, CDB-LSTM