Abstract
Text-to-image models are increasingly applied to human image generation, leveraging multimodal information under multiple conditions to produce high-quality human images. Despite their ability to generate detailed images, these models often struggle to maintain perceptual consistency across multiple viewpoints. To address this limitation, we propose Multi-View Human Diffusion (MVHDiff), a novel framework that integrates 3D human model priors and text prompts to generate high-quality, multi-view-consistent human images. MVHDiff separately acquires textual descriptions of human appearance and pose, as well as spatial information regarding the subject's orientation relative to the camera. Subsequently, a perceptual fusion module is employed to align these text features with the visual features extracted from the human image, thereby enabling the fused learning of prior information and image features. Further, MVHDiff finetunes both appearance descriptions and spatial viewpoint-related textual inputs, enabling precise text-based control over human attributes while ensuring semantic consistency across different spatial viewpoints. Experimental results demonstrate that MVHDiff significantly outperforms existing methods in generating text-guided human attributes with consistent multi-view representations, offering a robust solution for high-quality, text-driven human image generation. © 2026 Elsevier B.V.
| Original language | English |
|---|---|
| Article number | 133057 |
| Journal | Neurocomputing |
| Volume | 676 |
| Online published | 13 Feb 2026 |
| DOIs | |
| Publication status | Online published - 13 Feb 2026 |
Funding
This work was supported in part by the GuangDong Basic and Applied Basic Research Foundation (Project No. 2024A1515011437), in part by the Guangdong Provincial Science and Technology Program (Project No. 2025A050508003), in part by the National Key Research and Development Program of China (Project No. 2024YFE0105400), in part by the National Natural Science Foundation of China (Project Nos. 62372186 and 62472179), and in part by the Science and Technology Planning Project of Guangdong Province (Project No. 2025A0505020016).
Research Keywords
- 3D prior understanding
- Diffusion models
- Image generation
Fingerprint
Dive into the research topics of 'MVHDiff: Leveraging 3D priors for consistent multi-view human image generation with diffusion models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver