Abstract
Recent advances in video processing and the growth of social media have led to a surge in user-generated content (UGC) videos. However, various factors can degrade their quality, underscoring the need for robust video quality assessment (VQA) models to optimize devices, monitor quality, and enhance recommendation systems. While current VQA models can accurately evaluate the overall quality of UGC videos, they do not offer fine-grained assessments, making it difficult to pinpoint the sources of quality issues. In this work, we introduce a VQA model that evaluates UGC videos along six quality dimensions: color, noise, artifacts, blur, temporal consistency, and overall quality. We formulate the multi-dimensional VQA task as modeling the joint distribution of all quality dimensions, encouraging our model to learn the intrinsic mechanisms by which different factors influence perceived video quality. We utilize emerging multi-modal vision-language models as the base quality evaluators, supplementing them with two additional modules that deliver complementary information to deepen video quality understanding. Special care is also taken to handle UGC videos with various aspect ratios, enabling us to process UGC videos at their appropriate resolutions. Specifically, we adopt the NaFlex variant of the SigLIP2 model, which adaptively resizes video frames based on their original resolutions and aspect ratios. We also employ a multi-modal large language model (MLLM) as the base quality predictor (a variant of Q-Align), contributing additional improvements through model ensemble in our final quality prediction. Experimental results show that the proposed model outperforms others on the FineVQ dataset, confirming its effectiveness. The source code is available at https://github.com/zwx8981/NTIRE2025-XGC-Track1. © 2025 IEEE.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025) |
| Publisher | IEEE |
| Pages | 1548-1557 |
| Number of pages | 10 |
| ISBN (Electronic) | 979-8-3315-9994-2 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025) - Music City Center, Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025 |
Publication series
| Name | IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops |
|---|---|
| ISSN (Print) | 2160-7508 |
| ISSN (Electronic) | 2160-7516 |
Conference
| Conference | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025) |
|---|---|
| Place | United States |
| City | Nashville |
| Period | 11/06/25 → 15/06/25 |
Funding
This work was supported in part by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), the Fundamental Research Funds for the Central Universities, and the National Natural Science Foundation of China under Grants 62371283.
Research Keywords
- fine-grained ugc video quality assessment
- multi-dimensional distribution learning
- vision-language model
Fingerprint
Dive into the research topics of 'Multi-Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver