Visual Feature and Texture Coding for Machine Vision

面向機器視覺的視覺特徵和紋理編碼

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date10 May 2023

Abstract

With the remarkable development of artificial intelligence and the unprecedented growth of visual data, machine vision oriented visual data compression is gradually replacing human vision oriented compression in various applications. This thesis focuses on the performance improvement for the visual data compression towards machine vision. It mainly consists of three parts: 1) scalable feature and texture compression towards analysis-friendly face representation; 2) a unified optimization framework for deep learning based image compression towards machine vision; 3) a spatial temporal adaptive compression scheme (STA) for visual data compression towards various machine vision tasks. The first topic is explored to enhance the coding efficiency for facial image. The second topic aims at designing a unified compression scheme for natural image. The last one is studied to improve the compression performance of image and video for various machine vision tasks.

In the first part, we investigate the integration of feature and texture compression and show that a universal and collaborative visual information representation can be achieved in a hierarchical way. In particular, we study feature and texture compression in a scalable coding framework, where the base layer serves as the deep learning feature and the enhancement layer targets to reconstruct texture. Based on the strong generative capability of deep neural networks, the gap between the base feature layer and enhancement layer is further filled with feature-level texture reconstruction, with the goal of further constructing texture representations from features. As such, the residuals between the original and reconstructed texture could be further conveyed in the enhancement layer. To improve the efficiency of the proposed framework, the base layer neural network is trained in a multitask manner such that the learned features enjoy both high-quality reconstruction and high-accuracy analysis. The framework and optimization strategies are further applied in face image compression.

In the second part, we propose a deep image compression scheme towards machine vision, with the principle of ``begin with the end in mind''. In particular, a unified optimization scheme for end-to-end image compression towards machine vision is proposed, accompanied with an inverted bottleneck encoder structure, the dedicated designed variable bitrate coding and generalized rate-accuracy optimization. The presented framework, which jointly optimizes the compression and the machine vision networks, exploits the utmost potential of robust machine vision for compressed images. The variable bitrate modules towards machine vision, which effectively shrink the storage space for model parameters, are further developed to accommodate to the real-world applications. Moreover, an iterative algorithm is presented to achieve the optimality in terms of the generalized rate-accuracy towards machine vision.

In the third part, we propose a spatial temporal adaptive compression scheme (STA), which is machine vision oriented. The STA scheme is composed of machine vision oriented pre-analysis and pre-processing, spatial resampling, temporal resampling modules, and an adaptation algorithm, to achieve high compression efficiency for diverse image/video content. The proposed STA scheme achieves substantial compression efficiency improvement in various test datasets compared to Versatile Video Coding (VVC), the state-of-the-art visual data compression standard, on multiple machine vision tasks, including object detection, instance segmentation and object tracking.

Therefore, in this thesis, we make contributions to improve the visual data compression efficiency towards machine vision from the following three aspects: 1) The rate distortion and rate accuracy performances for facial images are improved with a scalable framework. 2) The rate accuracy performance improvement for natural images is achieved with a unified end-to-end optimization scheme. 3) The image and video compression efficiency is improved with a spatial temporal adaptive compression scheme. Extensive experimental results verify the effectiveness of the proposed schemes.