Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model

Zeyu Wang, Libo Zhao*, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

5 Citations (Scopus)

Abstract

Multi-modality image fusion aims to extract complementary features from multiple source images of different modalities, generating a fused image that inherits their advantages. To address challenges in cross-modality shared feature (CMSF) extraction, single-modality specific feature (SMSF) fusion, and the absence of ground truth (GT) images, we propose MTG-Fusion, a multi-text guided model. We leverage the capabilities of large vision-language models to generate text descriptions tailored to the input images, providing novel insights for these challenges. Our model introduces a text-guided CMSF extractor (TGCE) and a text-guided SMSF fusion module (TGSF). TGCE transforms visual features into the text domain using manifold-isometric domain transform techniques and provides effective visual-text interaction based on text-vision and text-text distances. TGSF fuses each dimension of visual features with corresponding text features, creating a weight matrix utilized for SMSF fusion. We also incorporate the constructed textual GT into the loss function for collaborative training. Extensive experiments demonstrate that MTG-Fusion achieves state-of-the-art performance on infrared and visible image fusion and medical image fusion tasks. The code is available at: https://github.com/zhaolb4080/MTG-Fusion. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
Original languageEnglish
Pages (from-to)4646-4668
JournalInternational Journal of Computer Vision
Volume133
Issue number7
Online published17 Mar 2025
DOIs
Publication statusPublished - Jul 2025
Externally publishedYes

Research Keywords

  • Manifold-domain transform
  • Multi-modality image fusion
  • Multi-text guidance
  • Vision-language generative model
  • Vision-text interaction

Fingerprint

Dive into the research topics of 'Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model'. Together they form a unique fingerprint.

Cite this