Abstract
Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5i and COCO-20i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain. © 2025 ACM.
| Original language | English |
|---|---|
| Title of host publication | MM '25 - Proceedings of the 33rd ACM International Conference on Multimedia |
| Publisher | Association for Computing Machinery |
| Pages | 3067-3075 |
| ISBN (Print) | 9798400720352 |
| DOIs | |
| Publication status | Published - Oct 2025 |
| Event | 33rd ACM International Conference on Multimedia (MM '25) - Royal Dublin Convention Centre, Dublin, Ireland Duration: 27 Oct 2025 → 31 Oct 2025 https://acmmm2025.org/ |
Publication series
| Name | MM - Proceedings of the ACM International Conference on Multimedia |
|---|
Conference
| Conference | 33rd ACM International Conference on Multimedia (MM '25) |
|---|---|
| Abbreviated title | ACM Multimedia 2025 |
| Place | Ireland |
| City | Dublin |
| Period | 27/10/25 → 31/10/25 |
| Internet address |
Bibliographical note
Full text of this publication does not contain sufficient affiliation information. The Research Unit(s) information for this record is based on the then academic department affiliation of the author(s).Funding
This work was supported in part by the National Natural Science Foundation of China Grant 62471278, Grant 62302141 and Grant 62331003, in part by the Taishan Scholar Project of Shandong Province under Grant tsqn202306079, and in part by the Research Grants Council of the Hong Kong Special Administrative Region, China under Grant STG5/E-103/24-R, and in part by the Fundamental Research Funds for the Central Universities Grant JZ2024HGTB0255.
Research Keywords
- few-shot learning
- multi-modal
- segment anything
- segmentation
RGC Funding Information
- RGC-funded
Fingerprint
Dive into the research topics of 'MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver