Skip to main navigation Skip to search Skip to main content

MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation

  • Hang Xiong
  • , Runmin Cong*
  • , Jinpeng Chen
  • , Chen Zhang
  • , Feng Li
  • , Huihui Bai
  • , Sam Kwong
  • *Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5i and COCO-20i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain. © 2025 ACM.
Original languageEnglish
Title of host publicationMM '25 - Proceedings of the 33rd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery
Pages3067-3075
ISBN (Print)9798400720352
DOIs
Publication statusPublished - Oct 2025
Event33rd ACM International Conference on Multimedia (MM '25) - Royal Dublin Convention Centre, Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025
https://acmmm2025.org/

Publication series

NameMM - Proceedings of the ACM International Conference on Multimedia

Conference

Conference33rd ACM International Conference on Multimedia (MM '25)
Abbreviated titleACM Multimedia 2025
PlaceIreland
CityDublin
Period27/10/2531/10/25
Internet address

Bibliographical note

Full text of this publication does not contain sufficient affiliation information. The Research Unit(s) information for this record is based on the then academic department affiliation of the author(s).

Funding

This work was supported in part by the National Natural Science Foundation of China Grant 62471278, Grant 62302141 and Grant 62331003, in part by the Taishan Scholar Project of Shandong Province under Grant tsqn202306079, and in part by the Research Grants Council of the Hong Kong Special Administrative Region, China under Grant STG5/E-103/24-R, and in part by the Fundamental Research Funds for the Central Universities Grant JZ2024HGTB0255.

Research Keywords

  • few-shot learning
  • multi-modal
  • segment anything
  • segmentation

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation'. Together they form a unique fingerprint.

Cite this