Data-Driven Visual Quality Assessment and Copy Detection


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date3 Sep 2019


Digital multimedia processing techniques such as image and video processing have many crucial applications, including entertainment, industrial manufacturing, and traffic monitoring. Multimedia quality assessment and copy detection play important roles in many stages of the system, including acquisition, compression, broadcasting, enhancement, and reproduction. Multimedia quality assessment techniques aim to assess the image or video quality automatically and perceptually. Data-driven quality assessment methods have recently drawn great attention due to the representation power of deep learning-based models. However, many existing metrics based on patches for training and testing still have room for improvement because the inefficient patch selection could be redesigned or removed. Content-based video copy detection techniques aim to quickly retrieve videos with visual content similar to that of the query video in a large-scale video database using only the spatial content with no additional information. We still face a trade-off between the local features’ robustness to the geometric attack and the global features’ efficiency in searching.

Many data-driven image quality assessment (IQA) algorithms deal with small image patches for training and testing; however, we have found that image patches from homogenous regions are unreliable. In addition, image patches with complex structures have a much greater likelihood of achieving better image quality prediction. Based on these findings, we enhanced the conventional convolutional neural network (CNN)-based no-reference (NR)-IQA algorithm to avoid the use of homogenous patches for network training and quality score estimation and propose a novel patch selection algorithm. Moreover, we use a variance-based weighting average to bias the final image quality score to the patches with complex structure. Our experimental results show that the proposed approach can achieve state-of-the-art performance relative to well-known NR-IQA algorithms.

To further enhance the performance of the data-driven IQA metric, we propose a novel general-purpose full-reference (FR)-IQA framework that eliminates the patch generation step and consists of two quality map generation paths and a built-in spatial pooling module for nonlinear regression. The global dependencies of the feature map are first used to generate the sensitivity map for quality assessment. The sensitivity map assigns different weights to various areas of an objective error map, which simulates the attention mechanism of the human visual system, so we call our proposed framework attention-boosted deep IQA (ABD-IQA). The proposed system was evaluated with three single-distortion-based IQA databases (LIVE, CSIQ, and TID2013) and the well-known multidistortion-based IQA database (MDID). It demonstrated performance superior to that of the most recent FR-IQA metrics, and a complete cross-database evaluation showed sufficient generalizability between databases. It was also found that image data based on multiple distortions are more useful for training robust image quality metrics.

We presented a new fully convolutional deep network model for FR video quality assessment (VQA) that we call a deep pooling network (DPNet). The proposed network focuses on pooling all of the spatial quality scores of the frame patches into a single quality score for a video, which offers an alternative method of harnessing the ability of deep models. To provide the spatial-temporal information for VQA, the FR image quality metric of the structural similarity index (SSIM) or gradient magnitude similarity deviation (GMSD) is used to form spatial quality maps, whereas the local average and the standard deviation of the optical flow are used for the two motion feature maps. These three handcrafted feature maps are then processed by parallel two-dimensional CNN structure and merged via elementwise multiplication to achieve a motion-masking effect. Using the merged feature map, CNN-based spatial-temporal pooling is used to extract a frame quality score for each input frame. To address the issue of temporal variation in frame quality, one-dimensional CNN is applied to the frame quality score sequence to achieve long-term temporal pooling to match human perception. Our experimental results show that the proposed FR VQA system achieves favorable performance with the LIVE and CSIQ VQA databases. Furthermore, this network can extract the quality attention map and the quality score of each frame from its hidden layers. The cross-database evaluation revealed that the DPNet exhibits favorable generalizability.

The efficient video copy detection system we proposed is based on a global video fingerprint and a searching strategy on the inverted file. In this system, the fast searching strategy for inverted files only involves simple table look-up and word counting operations for the fingerprint matching process. The similarity of video fragments is based on the number of matched fingerprints among all video candidates. In this method, the offset time is used, and fingerprints are sorted to further select the matched fingerprints from the video candidates. Moreover, we propose a novel regional average fingerprint that is compatible with the proposed fast searching strategy. The proposed system was compared with other state-of-the-art fingerprinting algorithms on the TRECVID 2011 dataset for various types of video distortions. In addition, the VCDB dataset was used to demonstrate the accuracy and efficiency of the proposed fast search strategy while using inverted files to demonstrate its practicality for use with a large database. The proposed system achieved higher accuracy with the VCDB dataset and showed a considerably faster operation speed than conventional inverted-file-based search methods.

Our proposed binary object fingerprint-based video copy detection system can improve the robustness to the geometric attack because the salient object can be robustly detected with the advanced CNN-based object detector. We propose that the well-known RetinaNet be used to generate object regions from the input frame, which are then used to generate binary fingerprints for fast copy detection in the database. This approach can maintain a compact representation of video frame and rapid search speed with its binary fingerprint search scheme. Our experimental results show that the proposed approach can improve the recall rate by about 10% while sacrificing only 1% of the prediction rate on the VCDB dataset.