Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Dake Bu, Wei Huang*, Andi Han, Atsushi Nitanda, Taiji Suzuki, Qingfu Zhang, Hau-San Wong*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings. © 2024 Neural information processing systems foundation. All rights reserved.
Original languageEnglish
Title of host publication38th Conference on Neural Information Processing Systems (NeurIPS 2024)
EditorsA. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang
PublisherNeural Information Processing Systems (NeurIPS)
Pages63342-63405
ISBN (Electronic)9798331314385
Publication statusPublished - Dec 2024
Event38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) - Vancouver Convention Center, Vancouver, Canada
Duration: 10 Dec 202415 Dec 2024
https://neurips.cc/
https://proceedings.neurips.cc/

Publication series

NameAdvances in Neural Information Processing Systems
Volume37
ISSN (Print)1049-5258

Conference

Conference38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
Abbreviated titleNeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period10/12/2415/12/24
Internet address

Funding

We thank the anonymous reviewers for their instrumental comments. D.B. and H.W. are supported in part by the Research Grants Council of the Hong Kong Special Administration Region (Project No. CityU 11206622). W.H. is supported in part by JSPS KAKENHI (24K20848). A.N. is supported in part by National Research Foundation, Singapore and Infocomm Media Development Authority under its Trust Tech Funding Initiative, the Centre for Frontier Artificial Intelligence Research, Institute of High Performance Computing, A*Star, and the College of Computing and Data Science at Nanyang Technological University. T.S. is supported in part by JSPS KAKENHI (24K02905) and JST CREST (JPMJCR2115, JPMJCR2015).

Fingerprint

Dive into the research topics of 'Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning'. Together they form a unique fingerprint.

Cite this