Skip to main navigation Skip to search Skip to main content

Towards Robust and Efficient Large Language Models with Cross-Layer Optimizations

Student thesis: Doctoral Thesis

Abstract

Large Language Models (LLMs) face critical inference challenges in balancing effectiveness and efficiency. The exponential growth in model size, context length, and key-value (KV) cache size intensifies computational demands, which hinders real-world deployment. This thesis addresses these limitations through three novel cross-layer optimizations to achieve robust and efficient LLM inference.

First, we propose EvoP, a novel evolutionary pruning framework for removing the
redundancy of LLMs. Different from heuristic pruning methods, EvoP first adopts
a cluster-based calibration dataset sampling (CCDS) to construct diverse calibration datasets. Then, EvoP employs an evolutionary pruning pattern search (EPPS) technique to search for the globally optimal pruning patterns. Experimental results demonstrate the efficacy of the proposed EvoP and the generalization capabilities on both in-domain and out-of-domain data.

To address long-context inefficiencies, we then develop ReFusion, a bi-level knowledge fusion framework for efficient long-context inference. ReFusion directly fuses knowledge representations into the hidden states of LLMs. Besides, ReFusion introduces two novel fusion schemes to rank different knowledge and an adaptive fusion integration framework to search for the optimal fusion combination. Experimental results demonstrate that ReFusion outperforms other baselines and offers a better trade-off between performance and efficiency.

Finally, although offloading the KV cache to main memory can alleviate the KV
cache explosion issues, performing the sparse attention using approximate nearest neighbor (ANN) search is still the main time cost during decoding. Recent works on learned index have shown the potential to accelerate the ANN search process. Thus, we introduce NFL to accelerate the access to the KV cache in main memory. Specifically NFL optimizes the one-dimensional learned index. NFL first uses Numerical Normalizing Flows to transform original keys into nearly-uniform distributions. Then, NFL uses an After-Flow Learned Indexes to offer robust and efficient indexing operations.
Experimental results show that NFL can accelerate the in-memory indexing process.

Overall, we establish a fundamental understanding of efficiency bottlenecks of
LLM inference, in terms of model size, long context inputs, and KV cache. With the proposed methods in this thesis, we holistically optimize robustness and efficiency of LLM inference.
Date of Award11 Sept 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorNan GUAN (Supervisor), Tei-Wei KUO (External Co-Supervisor) & Jason XUE (External Co-Supervisor)

Cite this

'