Skip to main navigation Skip to search Skip to main content

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Zheng GU, Shiyuan YANG, Jing LIAO*, Jing Huo*, Yang Gao

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively. Our project webpage is available at https://analogist2d.github.io. © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Original languageEnglish
Article number130
Pages (from-to)1-15
Number of pages15
JournalACM Transactions on Graphics
Volume43
Issue number4
DOIs
Publication statusPublished - 19 Jul 2024

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62276128, Grant 62192783 in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, and a GRF grant from the Research Grants Council (RGC) of the Hong Kong Special Administrative Region, China [Project No. CityU 11216122].

Research Keywords

  • Visual In-Context Learning
  • Diffusion Models
  • Image Transformation

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model'. Together they form a unique fingerprint.

Cite this