RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson Lau*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

1 Citation (Scopus)
43 Downloads (CityUHK Scholars)

Abstract

A practical large language model (LLM) service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (i.e., key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. We have observed significant performance improvements to a production-level system, vLLM, through integration with RelayAttention. The improvements are even more profound with longer system prompts. © 2024 Association for Computational Linguistics
Original languageEnglish
Title of host publicationProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
EditorsLun-Wei Ku, Andre Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics
Pages4945-4957
Volume1 (Long Papers)
ISBN (Print)9798891760943
DOIs
Publication statusPublished - Aug 2024
Event62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) - Centara Grand and Bangkok Convention Centre, Bangkok, Thailand
Duration: 11 Aug 202416 Aug 2024
https://aclanthology.org/2024.acl-long
https://2024.aclweb.org/
https://aclanthology.org/
https://aclanthology.org/2024.acl-tutorials
https://aclanthology.org/2024.findings-acl

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
Abbreviated titleACL2024
PlaceThailand
CityBangkok
Period11/08/2416/08/24
Internet address

Bibliographical note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'RelayAttention for Efficient Large Language Model Serving with Long System Prompts'. Together they form a unique fingerprint.

Cite this