Skip to main navigation Skip to search Skip to main content

The CUR Decomposition of Self-Attention Matrices in Vision Transformers

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Transformers have achieved great success in natural language processing and computer vision. The core and basic technique of transformers is the self-attention mechanism. The vanilla self-attention mechanism has quadratic complexity, which limits its applications to vision tasks. Most of the existing linear self-attention mechanisms will sacrifice performance to some extent to reduce complexity. In this paper, we propose a novel linear approximation of the vanilla self-attention mechanism named CURSA to achieve both high performance and low complexity at the same time. CURSA is based on the CUR decomposition to decompose the multiplication of large matrices into the multiplication of several small matrices to achieve almost linear complexity. Experiment results of CURSA in image classification tasks, semantic segmentation tasks, object detection tasks, and long-range arena show that it outperforms state-of-the-art self-attention mechanisms with better data efficiency, faster speed, and higher accuracy.

© 2025 IEEE
Original languageEnglish
Pages (from-to)4792-4809
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume48
Issue number4
Online published19 Dec 2025
DOIs
Publication statusPublished - Apr 2026

Funding

This work is supported by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), the Institute of Digital Medicine, City University of Hong Kong (Projects 9229503 and 9610460), the National Natural Science Foundation of China (No. 12561095), and the Special Posts of Guizhou University (No. [2025]06).

Research Keywords

  • Attention mechanism
  • CUR decomposition
  • linear approximation
  • vision transformer

Fingerprint

Dive into the research topics of 'The CUR Decomposition of Self-Attention Matrices in Vision Transformers'. Together they form a unique fingerprint.

Cite this