Image and Text Representation Learning based on Deep Neural Networks


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date19 Oct 2022


Nowadays, there are large amounts of data in our society. It becomes important to develop proper methods to analyze relevant data for different businesses and events. Among those methods, representation learning provides a critical point of view on the research area. Due to the prospering of deep learning, deep representations learned by neural networks become to benefit most learning tasks. Compared to conventional handcrafted features, deep representations are more flexible and scalable. In the machine learning area, deep representations bring significant performance improvements for large amounts of downstream tasks varying from computer vision to natural language processing. In this thesis, I concentrate on improving deep neural networks to learn deep representations for solving those image and text understanding tasks. (i) Contrastive sentence representation learning for image captioning evaluation: An intrinsic image captioning evaluation metric based on recurrent neural network and contrastive learning is proposed for captioning evaluation. It consists of a bi-directional GRU as an encoder and an LSTM as a decoder and is powered by self-supervision and contrastive semantic learning. (ii) Cross-modality representation learning for image captioning: An Intrinsic cross-modality captioning model is proposed to improve image captioning. A cross-modality alignment module is designed to boost the language decoder of the captioning model. With the help of cross-modality features aligning from visual to text, the model will not only learn to decode from the visual features but also learn to grasp the intrinsic features for better performance. (iii) Convolutional and transformer joint representation learning for image quality assessment: a hybrid framework that utilizes both deep CNN layers and transformer encoder is proposed for image quality estimation. The proposed framework is compatible with both FR and NR settings. Deep CNN is good at modeling local visual patterns, while the transformer encoder excels at modeling global contexts. By taking advantage of both the hybrid structured network and introducing several training strategies, improved performance for image quality assessment tasks is achieved. (iv) Dual Swin-transformer representation learning for RGB-D saliency object detection: a dual Swin-transformer-based mutual interactive network is proposed for the task of RGB-D saliency prediction. Swin-Transformer is adopted as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Attention-based modules are applied to enhance features from each modality. A self-attention-based cross-modality interaction module and a gated modality attention module are proposed to leverage the complementary information between the two modalities. For saliency decoding, different stages enhanced with dense connections are proposed. A skip convolution module is presented to give more guidance from RGB modality to the final saliency prediction. In addition, edge supervision is also included as a regularization for better representation learning for the task.