Data-Driven Compact Feature Learning for Visual Retrieval


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date1 Mar 2021


In the past two decades, data with high-dimensional raw representation has become very important in many applications, such as the raw images in the digital camera, high-definition images, etc. These high-dimensional data can provide rich content and a higher-quality visual experience, but this often encounters some challenges for existing systems. Obviously, there are strict requirements on storage cost and computational efficiency. In addition, it is difficult to process and understand high-dimensional data at a semantic level, such as image search in a photo collection or human tracking in a video surveillance system. Therefore, it is usually necessary to map high-dimensional data to low-dimensional vector representations for effective retrieval. In order to achieve this goal, a prominent method is hash-based Approximate Nearest Neighbor (ANN) search, which can project high-dimensional data into low-dimensional binary vector embedding. This binary-valued representation has many advantages, such as a small memory footprint and a fast search mechanism using Hamming distance comparison. Therefore, this thesis is dedicated to developing several compact feature learning algorithms based on deep hashing and deep quantization approaches for efficient visual data retrieval. Deep hashing aims to project the raw image data into low-dimensional binary code embeddings, while deep quantization uses multiple codebooks to approximate the data points. Each approach has its own advantage, which can achieve compact feature representation, thereby improving search speed.

Triplet loss has been widely used in hash-based retrieval tasks because it builds discriminative feature space for data with different-level variance. However, it requires a well-designed sample selection training strategy, which is essential for learning high-quality feature representations and boosting neural network convergence speed. In this thesis, it was found that the binary code generated under the supervision of the triplet loss is redundant for searching related items. Specifically, every bit of code contributes unevenly in terms of robustness and uniqueness. In order to alleviate these problems, a deep supervised hashing method is proposed with the use of 3D triple selection and a novel binary code selection algorithm. The triple selection process is improved by an online generator using all unique triplets in each mini-batch, which is efficient, and space-saving compared to offline training. In addition, a unified binary code selection algorithm is also developed to remove redundant binary codes. Thus, we can obtain a scalable hash code with minimal loss of accuracy. Experimental results show that the proposed method can achieve promising performance on widely used datasets.

In order to avoid metric learning being limited by expensive training costs and inevitable quantization errors, a novel deep hashing algorithm called Angle Deep Supervised Hash (ADSH) is proposed. This algorithm uses A-softmax loss to explicitly improve intra-class compactness and inter-class separability of learned features in hash space. We define the Hamming distance matrix to optimize the weight vectors of the softmax layer based on the mean and variance. Geometrically, this constrains the generated deep features by adjusting the feature distribution angles on the hypersphere to fit the hash space better. In addition, a dynamic softmax layer is designed to deal with training problems for multi-label cases. Extensive experiments conducted on two well-known CIFAR-10 and NUS-WIDE datasets show that the proposed ADSH can generate high-quality and compact binary codes to achieve accurate retrieval performance.

Inspired by the unique viewpoint of softmax loss, the ADSH algorithm was re-studied and revealed a performance gap between models with and without quantification. Based on this observation, we argue that the linear weights of the softmax layer may prevent the network from learning the discriminative feature space where the learning process occurs between the weight space and the feature space. To address the unmatched goals in the training and retrieval stage, a non-parametric softmax loss function is developed to supervise the similarity between deep features directly. Specifically, the proposed loss adopts the idea of contrastive learning involving positive and negative terms, thereby converting the multiple classification problem into a binary classification problem. In addition, a quantization loss is developed to control the quantization error of the generated hash code. A large number of experiments on large-scale benchmark datasets show that this model can achieve better results compared with ADSH and other deep hashing methods.

Quantization techniques have been widely used in ANN similarity search, data compression, etc. However, most deep quantization methods use the unsupervised approach to train the codebook independently of the deep neural network. In addition, these deep quantization methods are designed based on the Euclidean distance in the feature space, which may not achieve good performance in "Maximum Inner Product Search" (MIPS). Heuristically, we find that the softmax loss can be viewed as vector quantization on hyperspherical manifolds. On this basis, a new angular deep supervised vector quantization (ADSVQ) is proposed, which uses softmax loss to show the direct connection between classification tasks and retrieval tasks. Specifically, the linear weight vectors are regarded as the centroids of the vector quantization in the hyperspherical manifold, and the asymmetric distance calculation is reformulated into the classification stage. In addition, a multi-label case was also discussed with the use of the sigmoid function. Our experimental results show that ADSVQ achieves state-of-the-art performance against the recent deep quantization models on the well-known datasets.

Recently, deep triplet quantization (DTQ) based on metric learning has proposed a group hard strategy to mine high-quality triplet samples. To further enhance its performance, we propose a new deep triplet residual quantization (DTRQ) model, which integrates residual quantization (RQ) into the triplet selection strategy and the quantization error control of MIPS. Specifically, instead of randomly grouping the samples as in DTQ, the samples are grouped based on the geographical information provided by RQ so that each group can generate more high-quality triplets for faster convergence. In addition, the triplet quantization loss is decomposed into norms and angles, which can significantly reduce code redundancy in the MIPS ranking. By adding residual quantization in the triple selection stage and quantization error control, DTRQ can generate high-quality and compact binary codes, which can generate promising results on the three benchmark data sets of NUS-WIDE, CIFAR-10, and MS-COCO Image retrieval performance. The cross-database evaluation shows that DTRQ has a faster training speed and higher performance than DTQ.