Designing Effective Attention Mechanisms for Transformers
為Transformer設計有效的注意力機制
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 15 Jul 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(d6d32823-4d8a-4e36-a986-3ce5c6319223).html |
---|---|
Other link(s) | Links |
Abstract
Transformers, which are a type of deep neural network, have gradually become the dominant architectures for both natural language processing and image understanding in recent years. As the core building block, attention is a powerful tool to capture long-range dependency and plays a vital role in the tremendous success of transformers. However, such power comes at a cost: vanilla attention is expensive to compute, especially for long sequences or high-resolution images. This thesis presents three effective as well as efficient designs of the attention mechanisms, two of which are for vision backbones (i.e., BRA and GLMix) and one for language models (i.e., RelayAttention).
Bi-Level Routing Attention (BRA) utilizes the locality and hierarchy of images to filter out irrelevant key-value pairs at the coarse region level. Unlike previous sparse attention mechanisms that rely on predefined and handcrafted sparse patterns such as local windows, axial stripes and dilated windows, BRA allows more flexible allocation of computations by creating dynamic, content-aware sparse patterns. As a result, we are able to build a family of powerful vision backbones, namely BiFormer, which achieves favorable performances over state-of-the-art vision backbones in a series of computer vision tasks.
As an integration scheme, GLMix proposes to apply attention and convolution at different granularity levels. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. By offloading the burden of extracting fine-grained location-preserving features to lightweight convolutions with the grid representation, we find that it is sufficient to apply attention to the small set of semantic slots for global inter-region relation modeling. At the core of GLMix is a pair of soft clustering and dispatching modules, which help establish correspondences between the feature grids and semantic slots and thus enable local-global fusion.
Unlike BRA and GLMix, which are designed for vision models, RelayAttention is a post-training optimization to accelerate large language model (LLM) inference. For the typical LLM serving scenarios, where a shared system prompt is prepended to each request, we observe that handling the system prompt requires heavily redundant KV cache accesses: for batched requests, the cached key-value pairs of the system prompt (system KVs) are read from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. Such a redundancy severely slows down the inference of LLMs and increases the hosting costs, which are already extremely high. In this regard, RelayAttention is proposed to eliminate the redundancy, allowing the reading of the system KVs exactly once for a batch of input tokens. As RelayAttention is based on a mathematical reformulation of causal attention, it maintains the generation quality while significantly improving the hardware utilization in both theory and practice.
Bi-Level Routing Attention (BRA) utilizes the locality and hierarchy of images to filter out irrelevant key-value pairs at the coarse region level. Unlike previous sparse attention mechanisms that rely on predefined and handcrafted sparse patterns such as local windows, axial stripes and dilated windows, BRA allows more flexible allocation of computations by creating dynamic, content-aware sparse patterns. As a result, we are able to build a family of powerful vision backbones, namely BiFormer, which achieves favorable performances over state-of-the-art vision backbones in a series of computer vision tasks.
As an integration scheme, GLMix proposes to apply attention and convolution at different granularity levels. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. By offloading the burden of extracting fine-grained location-preserving features to lightweight convolutions with the grid representation, we find that it is sufficient to apply attention to the small set of semantic slots for global inter-region relation modeling. At the core of GLMix is a pair of soft clustering and dispatching modules, which help establish correspondences between the feature grids and semantic slots and thus enable local-global fusion.
Unlike BRA and GLMix, which are designed for vision models, RelayAttention is a post-training optimization to accelerate large language model (LLM) inference. For the typical LLM serving scenarios, where a shared system prompt is prepended to each request, we observe that handling the system prompt requires heavily redundant KV cache accesses: for batched requests, the cached key-value pairs of the system prompt (system KVs) are read from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. Such a redundancy severely slows down the inference of LLMs and increases the hosting costs, which are already extremely high. In this regard, RelayAttention is proposed to eliminate the redundancy, allowing the reading of the system KVs exactly once for a batch of input tokens. As RelayAttention is based on a mathematical reformulation of causal attention, it maintains the generation quality while significantly improving the hardware utilization in both theory and practice.