Deep Networks for Face Recognition and Retrieval Based on Learning Eigen-Filters and Binary Code Representations

基於特徵卷積核學習和二值化編碼表徵學習的深度網絡及在人臉識別和人臉圖像檢索上的應用

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date14 Sept 2021

Abstract

A massive amount of visual content, a considerable portion of which comprises face images, is uploaded to various social media every day due to smartphone popularity with advanced cameras. Thus, there is an urgent need to develop more efficient and accurate face recognition and face retrieval methods. Although current state-of-the-art works on deep face recognition have provided a promising and general solution for face recognition and retrieval tasks, they still suffer some drawbacks and leave much space for improvement. This thesis explores novel ideas in the tradeoff between efficiency and accuracy in relation to face recognition and retrieval tasks from two perspectives. The first perspective is obtaining convolutional filters for more efficient feature representation with better generalization. The second one is learning binary code representations for large-scale face image retrieval.

In the first part of the thesis, we study a new method of face recognition from the aforementioned first perspective. The training of deep convolutional neural networks (CNNs) with millions of learnable parameters often requires high computational costs and large amounts of labeled data. In light of these problems, we compute predefined convolution kernels from training data. To this end, we propose a novel alternative unsupervised three-stage approach for filter learning. It learns filters in multiple structures, including standard filters, channel-wise filters, and point-wise filters inspired by variations of CNNs’ convolution operations. By analyzing the linear combination between learned filters and original convolution kernels in pre-trained CNNs, the reconstruction error is minimized to determine the most representative filters from the filter bank. These learned filters are used to build a network, followed by HOG-based feature extraction for feature representation. The proposed method is efficient computationally in both the training and inference stage. However, it performs comparably to deep learning-based methods for face recognition. Particularly, it is remarkably robust under variations of facial expression and illumination. The proposed approach provides a perspective of interpreting CNNs by introducing the concepts of advanced convolutional layers to unsupervised filter learning.

In the second part of the thesis, we focus on learning binary code representations by deep CNNs for efficient and discriminative face image retrieval. Note that computing the Euclidean or Cosine distance between real-valued features in a large-scale database remains a high burden for mobile or embedded devices, where both the latency tolerance and storage are restricted. Fortunately, learning compact binary representations that enable approximate nearest neighbor (ANN) searching has provided a workable solution, significantly improving the query speed and reducing storage cost. We first investigate the hashing method, which maps high dimensional data into lower dimensional binary codes in Hamming space while retaining the original similarity. Recently, deep hashing methods with end-to-end deep feature learning and hashing learning have surpassed the traditional hashing methods with a large margin for image retrieval. However, due to the challenging intra-class variations in face images, neither existing pairwise/triplet labels-based nor softmax classification loss-based deep hashing approaches can generate sufficiently compact and discriminative binary codes. Considering these issues, we propose a center-based framework integrating end-to-end hashing learning and class center learning simultaneously. The framework minimizes the intra-class variance by clustering intra-class samples into a learnable class center. Besides, we propose a novel regularization term to enlarge the Hamming distance between pairwise class centers for inter-class separability. Furthermore, an effective regression matrix is introduced to encourage intra-class samples to generate the same binary codes, thereby enhancing the hashing codes compactness. Experiments on four large-scale datasets show that the proposed method outperforms state-of-the-art baselines under various code lengths and commonly-used evaluation metrics.

In the final chapter of the thesis, we adopt another well-known ANN technique, i.e., product quantization (PQ), for binary code representation in Euclidean subspace. PQ takes advantage of producing much more distinct values to depict the pairwise similarity between any two samples, indicating its superiority to hashing-based methods for better binary code generation. Although deep hashing has emerged as an effective solution for large-scale face image retrieval, its counterpart deep quantization methods, which learn binary code representation with dictionary-related distance metrics, have seldom been explored for the task. To fill in the gap in the literature, we make the first attempt to integrate product quantization into a deep learning framework for large-scale face image retrieval. Unlike prior deep quantization methods where the codewords for quantization are learned together with network parameters, we propose a novel scheme using predefined orthonormal vectors as codewords. These predefined codewords with a fixed 90-degree angular separation aim to enhance the quantization informativeness and reduce the codewords’ redundancy. To make the most of the discriminative information, we design a tailored loss function that maximizes the identity discriminability in each quantization subspace for both the quantized and the original features. Furthermore, an entropy-based regularization term is imposed to reduce the quantization error. We conduct experiments on three commonly-used datasets under the settings of both single- and cross-domain retrieval. The proposed method significantly outperformed all the compared deep hashing/quantization methods under both settings. We observe that the proposed orthonormal codewords consistently improved both models' standard retrieval performance and generalization ability. Therefore, the proposed method is more suitable for scalable face image retrieval than deep hashing methods.