Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment

Baoliang Chen, Lingyu Zhu, Guo Li, Fangbo Lu, Hongfei Fan, Shiqi Wang*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

75 Citations (Scopus)

Abstract

In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method. The codes are released at https://github.com/Baoliang93/GSTVQA
Original languageEnglish
Pages (from-to)1903-1916
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume32
Issue number4
Online published11 Jun 2021
DOIs
Publication statusPublished - Apr 2022

Research Keywords

  • Feature extraction
  • Quality assessment
  • Training
  • Video recording
  • Image quality
  • Streaming media
  • Nonlinear distortion
  • Video quality assessment
  • generalization capability
  • deep neural networks
  • temporal aggregation
  • IMAGE
  • STATISTICS
  • DATABASE

Fingerprint

Dive into the research topics of 'Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment'. Together they form a unique fingerprint.

Cite this