Efficient Scheduling of Distributed Deep Neural Network Workloads
分佈式深度神經網絡任務的高效調度
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 27 Feb 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(8deb5373-ff15-4579-85df-c097a089ad55).html |
---|---|
Other link(s) | Links |
Abstract
Deep Neural Networks (DNNs) have become the cornerstone for a myriad of AI applications. However, the growing complexity and size of DNN models, along with the increasing scale of datasets, have precipitated a surge in computational resource requirements, elevating the costs of training and inference. To tackle these issues, DNNs are increasingly deployed across extensive GPU clusters. Yet, the design of systems to host distributed DNN workloads encounters significant challenges across the infrastructure, framework, and algorithmic layers.
This Ph.D. thesis contributes the realm of distributed DNN systems by addressing challenges on multiple fronts in a bottom-up approach. At the infrastructure layer, we design Lyra, that addresses the problem of separate training and inference clusters by introducing capacity loaning and elastic scaling. This novel cluster scheduler significantly reduces queuing times and completion times of DNN training jobs, improving cluster resource utilization. Moving to the model framework layer, Lina addresses the challenges of distributed training and inference of sparsely activated models, specifically Mixture-of-Experts (MoE) language models, by identifying and alleviating communication bottlenecks. This results in substantial reductions in training step time and inference latency. Lastly, the thesis introduces Adaptive Gating in MoE, a flexible training strategy that reduces the computation costs of each token based on its linguistic complexity. This algorithmic approach achieves less training FLOPS and time while maintaining the same inference quality.
The collective advancements presented in this thesis signify a small but meaningful advancement in the scalability and efficiency of systems underpinning distributed DNN workloads.
This Ph.D. thesis contributes the realm of distributed DNN systems by addressing challenges on multiple fronts in a bottom-up approach. At the infrastructure layer, we design Lyra, that addresses the problem of separate training and inference clusters by introducing capacity loaning and elastic scaling. This novel cluster scheduler significantly reduces queuing times and completion times of DNN training jobs, improving cluster resource utilization. Moving to the model framework layer, Lina addresses the challenges of distributed training and inference of sparsely activated models, specifically Mixture-of-Experts (MoE) language models, by identifying and alleviating communication bottlenecks. This results in substantial reductions in training step time and inference latency. Lastly, the thesis introduces Adaptive Gating in MoE, a flexible training strategy that reduces the computation costs of each token based on its linguistic complexity. This algorithmic approach achieves less training FLOPS and time while maintaining the same inference quality.
The collective advancements presented in this thesis signify a small but meaningful advancement in the scalability and efficiency of systems underpinning distributed DNN workloads.