TY - GEN
T1 - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning
AU - Bu, Dake
AU - Huang, Wei
AU - Han, Andi
AU - Nitanda, Atsushi
AU - Suzuki, Taiji
AU - Zhang, Qingfu
AU - Wong, Hau-San
PY - 2024/12
Y1 - 2024/12
N2 - Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings. © 2024 Neural information processing systems foundation. All rights reserved.
AB - Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings. © 2024 Neural information processing systems foundation. All rights reserved.
UR - http://www.scopus.com/inward/record.url?scp=105000508273&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-105000508273&origin=recordpage
M3 - RGC 32 - Refereed conference paper (with host publication)
T3 - Advances in Neural Information Processing Systems
SP - 63342
EP - 63405
BT - 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
A2 - Globerson, A.
A2 - Mackey, L.
A2 - Belgrave, D.
A2 - Fan, A.
A2 - Paquet, U.
A2 - Tomczak, J.
A2 - Zhang, C.
PB - Neural Information Processing Systems (NeurIPS)
T2 - 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
Y2 - 10 December 2024 through 15 December 2024
ER -