Towards Generalizable Blind Image Quality Assessment


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date16 Mar 2022


In the past seven years, data-driven Blind IQA (BIQA) models based on deep neural networks (DNNs) came to outperform knowledge-driven models based on natural scene statistics(NSS), in terms of correlation with human data on existing IQA databases. Nevertheless, the impressive correlation numbers achieved by DNN-based BIQA models are questionable. First, the model comparison has been performed using a small set of images, which are not sufficiently representative of the enormous space and diversity of the natural images population. Second, the same test images have been used to evaluate the models for many years. This raises the risk of overfitting by extensive adaptation to excessively reused test sets.

In this thesis, we first focus on leveraging group maximum differentiation (gMAD) competition examples to improve the BIQA performance. Based on convolutional neural networks (CNNs), we construct a top-performing BIQA model serving as the baseline that performs favorably against previous BIQA methods in assessing the perceptual quality of synthetically distorted images. After that, the baseline is compared against a group of stronger full-reference IQA methods in gMAD, attempting to seek its counterexamples for subjective testing. We further let the model adapt to the selected gMAD examples without forgetting previously trained databases by fine-tuning on images from both sources. Finally, we iterate the entire process of gMAD example mining, subjective testing, and fine-tuning several rounds, enabling active learning from gMAD examples for BIQA. However, when applying this progressive failure identification pipeline to troubleshoot "best-performing" BIQA models in the wild, we are faced with a practical challenge: it is highly nontrivial to obtain stronger competing models for efficient failure-spotting. Inspired by recent findings that difficult samples of deep models may be exposed through network pruning, we construct a set of "self-competitors" as random ensembles of pruned versions of the target model to be improved. Diverse failures can then be efficiently identified via self-gMAD competition. Next, we fine-tune both the target and its pruned variants on the human-rated gMAD set. This allows all models to learn from their respective failures, preparing themselves for the next round of self-gMAD competition. Experimental results demonstrate that our method efficiently troubleshoots BIQA models with improved generalizability.

We then explore semi-supervised learning (SSL) for DNN-based BIQA. BIQA models are generally developed in supervised manners, optimizing and testing by comparing to human ratings in terms of mean opinion scores(MOSs), which are labor-expensive to collect. The performance of these approaches relies significantly on amounts of labeled data for training. When human-rated data are insufficient, BIQA models may perform poorly. Therefore, we then explore pseudo-labeling and semi-supervised negative correlation learning (NCL) to mitigate the dependency on large-scale labeled datasets of training DNN-based BIQA models. In pseudo-labeling, we devise a deep ensemble-based BIQA model (termed as a target to promote) with two heads for quality estimation and pseudo-label guessing, respectively. It is first trained on a small set of human-rated images, where the supervisory signals are binary labels indicating the pairwise ranking order of perceptual quality of an image pair. Then, we use the head of pseudo-label guessing to assign pseudo-binary labels for the unlabeled pairs. Afterward, we re-train the target on the combination of labeled and pseudo labeled datasets. This process may be iterated, enabling progressive improve the performance of the target. In semi-supervised NCL, we train a multi-head convolutional network for quality prediction by maximizing the accuracy of the ensemble (as well as the base learners) on labeled data and the disagreement (i.e., diversity) among them on unlabeled data. We conduct extensive experiments to demonstrate the advantages of employing unlabeled data for BIQA, especially in model generalization and failure identification.