Resource-efficient Execution of Deep Neural Networks on Mobile Edge Devices
針對移動邊緣設備的資源高效深度神經網絡執行的研究
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 5 Aug 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(b4265d97-a600-40d4-8a30-32133bc74d67).html |
---|---|
Other link(s) | Links |
Abstract
The proliferation of deep neural networks (DNNs) execution on mobile edge devices heralds a transformative era in mobile computing, enabling a wide range of thrilling applications. For instance, deploying DNNs on pedestrians and vehicles' mobile edge devices for inference makes vehicle-to-everything (V2X) systems more efficient. Additionally, conducting training on these devices can realize the vision of federated learning, enhancing the performance of DNNs in V2X systems while safeguarding data privacy. Moreover, personalizing exciting large language models (LLMs) for deployment on mobile edge devices can significantly improve user experience. However, the computational capabilities and memory resources of mobile edge devices are severely limited, making it challenging to implement these applications in real-world scenarios. This dissertation focuses on overcome these limitations by resource-efficient neural network execution on mobile edge devices, including efficient DNNs inference, training, and deployment of the LLMs, which unlocking the full potential of edge AI technologies.
To realize above prospects, this dissertation first proposed SwapNet, a novel middleware designed for resource-efficient inference in edge devices. Faced with inherent memory-constraint in edge devices, SwapNet ingeniously partitions deep neural networks (DNNs) into manageable blocks, swapping them in and out of memory as needed, thereby enabling the inference of complex DNNs beyond the device's memory capacity. By eliminating the unnecessary memory copy during block swapping, SwapNet further saves the memory capacity. This approach preserves model accuracy, a critical advancement for memory-constrained edge AI applications.
Building on the successful implementation of efficient DNN inference on edge AI devices through SwapNet, we naturally want to further promote resource-efficient training on edge AI devices. This dissertation therefore continually introduces LATTE, a system designed for federated learning on heterogeneous mobile edge devices. Federated training of the same model across edge devices with heterogeneous computational resources is challenging due to significant differences in training times, which impede the convergence speed of the federated model. LATTE is capable of accurately estimating the model training times on different devices, and then efficiently allocating sub-models of the federated model to heterogeneous edge devices, achieving similar training durations. This approach significantly enhances the convergence speed of the central model within the federated learning system.
With the rapid advancements in LLMs, there is an urgent need to integrate these powerful models into mobile edge devices. Therefore, after achieved efficient DNN inference and training, this dissertation finally focus on deploying large language models (LLMs) on mobile edge devices. The LLMs require far more resources, making direct deployment on edge devices impractical. In the end part of this dissertation, we study the feasibility of using efficient attention mechanisms to reduce computational complexity and memory footprint, enabling LLM inference on resource-constrained devices. By aiming to optimize key-value (KV) cache, we demonstrate effective LLM operation on edge devices without compromising performance, opening new possibilities for sophisticated AI applications and suggesting future research directions.
In summary, this dissertation presents a comprehensive suite of solutions, from SwapNet's memory-efficient DNN inference to LATTE's computation-efficient federated training and finally study the feasibility of efficient LLM execution by utilizing efficient attention mechanism. Each contributing to the overarching goal of realizing the full potential of AI on mobile edge devices.
To realize above prospects, this dissertation first proposed SwapNet, a novel middleware designed for resource-efficient inference in edge devices. Faced with inherent memory-constraint in edge devices, SwapNet ingeniously partitions deep neural networks (DNNs) into manageable blocks, swapping them in and out of memory as needed, thereby enabling the inference of complex DNNs beyond the device's memory capacity. By eliminating the unnecessary memory copy during block swapping, SwapNet further saves the memory capacity. This approach preserves model accuracy, a critical advancement for memory-constrained edge AI applications.
Building on the successful implementation of efficient DNN inference on edge AI devices through SwapNet, we naturally want to further promote resource-efficient training on edge AI devices. This dissertation therefore continually introduces LATTE, a system designed for federated learning on heterogeneous mobile edge devices. Federated training of the same model across edge devices with heterogeneous computational resources is challenging due to significant differences in training times, which impede the convergence speed of the federated model. LATTE is capable of accurately estimating the model training times on different devices, and then efficiently allocating sub-models of the federated model to heterogeneous edge devices, achieving similar training durations. This approach significantly enhances the convergence speed of the central model within the federated learning system.
With the rapid advancements in LLMs, there is an urgent need to integrate these powerful models into mobile edge devices. Therefore, after achieved efficient DNN inference and training, this dissertation finally focus on deploying large language models (LLMs) on mobile edge devices. The LLMs require far more resources, making direct deployment on edge devices impractical. In the end part of this dissertation, we study the feasibility of using efficient attention mechanisms to reduce computational complexity and memory footprint, enabling LLM inference on resource-constrained devices. By aiming to optimize key-value (KV) cache, we demonstrate effective LLM operation on edge devices without compromising performance, opening new possibilities for sophisticated AI applications and suggesting future research directions.
In summary, this dissertation presents a comprehensive suite of solutions, from SwapNet's memory-efficient DNN inference to LATTE's computation-efficient federated training and finally study the feasibility of efficient LLM execution by utilizing efficient attention mechanism. Each contributing to the overarching goal of realizing the full potential of AI on mobile edge devices.