NoteLLM-2: Multimodal Large Representation Models for Recommendation

Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu*, Xiangyu Zhao*, Yan Gao, Yao Hu, Enhong Chen*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

4 Citations (Scopus)

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. However, their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. While leveraging existing Multimodal Large Language Models (MLLMs) for such tasks is promising, challenges arise due to their delayed release compared to corresponding LLMs and the inefficiency in representation tasks. To address these issues, we propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation. Preliminary experiments revealed that fine-tuned LLMs often neglect image content. To counteract this, we propose NoteLLM-2, a novel framework that enhances visual information. Specifically, we propose two approaches: first, a prompt-based method that segregates visual and textual content, employing a multimodal In-Context Learning strategy to balance focus across modalities; second, a late fusion technique that directly integrates visual information into the final representations. Extensive experiments, both online and offline, demonstrate the effectiveness of our approach. Code is available at https://github.com/Applied-Machine-Learning-Lab/NoteLLM. ©2025 Copyright held by the owner/author(s).
Original languageEnglish
Title of host publicationKDD '25
Subtitle of host publicationProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages2815-2826
Volume1
ISBN (Print)979-8-4007-1245-6
DOIs
Publication statusPublished - Jul 2025
Event31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025) - Toronto, Canada
Duration: 3 Aug 20257 Aug 2025
https://kdd2025.kdd.org/
https://dl.acm.org/conference/kdd/proceedings

Conference

Conference31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025)
PlaceCanada
CityToronto
Period3/08/257/08/25
Internet address

Bibliographical note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Funding

This work was supported in part by the grants from National Science and Technology Major Project (No. 2023ZD0121104), and National Natural Science Foundation of China (No.62222213, 62072423). Xiangyu Zhao was partially supported by Research Impact Fund (No.R1015-23), APRC - CityU New Research Initiatives (No.9610565, Start-up Grant for New Faculty of CityU), CityU - HKIDS Early Career Research Grant (No.9360163), Hong Kong ITC Innovation and Technology Fund Midstream Research Programme for Universities Project (No.ITS/034/22MS), Hong Kong Environmental and Conservation Fund (No. 88/2022), SIRG - CityU Strategic Interdisciplinary Research Grant (No.7020046), Huawei (Huawei Innovation Research Program), Tencent (CCF-Tencent Open Fund, Tencent Rhino-Bird Focused Research Program), Ant Group (CCF-Ant Research Fund, Ant Group Research Fund), Alibaba (CCF-Alimama Tech Kangaroo Fund No. 2024002), CCF-BaiChuan-Ebtech Foundation Model Fund, Collaborative Research Fund No.C1043-24GF, and Kuaishou.

Research Keywords

  • Multimodal Large Language Model
  • Recommendation
  • Multimodal Representation

Fingerprint

Dive into the research topics of 'NoteLLM-2: Multimodal Large Representation Models for Recommendation'. Together they form a unique fingerprint.

Cite this