Skip to main navigation Skip to search Skip to main content

Multi-Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models

Weixia Zhang, Bingkun Zheng, Junlin Chen, Zhihua Wang

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Recent advances in video processing and the growth of social media have led to a surge in user-generated content (UGC) videos. However, various factors can degrade their quality, underscoring the need for robust video quality assessment (VQA) models to optimize devices, monitor quality, and enhance recommendation systems. While current VQA models can accurately evaluate the overall quality of UGC videos, they do not offer fine-grained assessments, making it difficult to pinpoint the sources of quality issues. In this work, we introduce a VQA model that evaluates UGC videos along six quality dimensions: color, noise, artifacts, blur, temporal consistency, and overall quality. We formulate the multi-dimensional VQA task as modeling the joint distribution of all quality dimensions, encouraging our model to learn the intrinsic mechanisms by which different factors influence perceived video quality. We utilize emerging multi-modal vision-language models as the base quality evaluators, supplementing them with two additional modules that deliver complementary information to deepen video quality understanding. Special care is also taken to handle UGC videos with various aspect ratios, enabling us to process UGC videos at their appropriate resolutions. Specifically, we adopt the NaFlex variant of the SigLIP2 model, which adaptively resizes video frames based on their original resolutions and aspect ratios. We also employ a multi-modal large language model (MLLM) as the base quality predictor (a variant of Q-Align), contributing additional improvements through model ensemble in our final quality prediction. Experimental results show that the proposed model outperforms others on the FineVQ dataset, confirming its effectiveness. The source code is available at https://github.com/zwx8981/NTIRE2025-XGC-Track1. © 2025 IEEE.
Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025)
PublisherIEEE
Pages1548-1557
Number of pages10
ISBN (Electronic)979-8-3315-9994-2
DOIs
Publication statusPublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025) - Music City Center, Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025)
PlaceUnited States
CityNashville
Period11/06/2515/06/25

Funding

This work was supported in part by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), the Fundamental Research Funds for the Central Universities, and the National Natural Science Foundation of China under Grants 62371283.

Research Keywords

  • fine-grained ugc video quality assessment
  • multi-dimensional distribution learning
  • vision-language model

Fingerprint

Dive into the research topics of 'Multi-Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models'. Together they form a unique fingerprint.

Cite this