Accelerating Large Scale Training and Inference for Deep Learning

加速深度學習的訓練和推理系統

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date25 Sep 2020

Abstract

High efficient training and inference play significant roles in AI-based applications, such as image classification, speech recognition, object detection, etc., where complex models are leveraged to achieve higher performance in terms of higher prediction accuracy. However, those complex models slow down the execution of training and inference, especially for the large scale distributed training and edge device inference. The main reason for the low efficient execution is that the scheduler always considers the task coarse-grained instead of analyzing the models in the task.

In this dissertation, we propose our frameworks to accelerate training and inference in deep learning systems. Deep learning models usually maintain the ordered combination of different layers and each layer presents different features on computation cost, communication cost, function, and so on. Our core idea is to utilize the characteristic of different layers to optimize deep learning and inference systems.

• Stanza: a new distributed deep learning system with efficient communication. Stanza exploits the fact that different layers in layer-based deep learning models maintain different characteristics on computation and communication. Rather than placing the whole model in a worker, Stanza assigns different layers to the corresponding workers according to their communication and computation cost. Stanza performs training separately: the majority of the cluster nodes just train the convolutional layers, and the rest train the fully connected layers only.

• Saec: a high efficient embedding compression system for the recommender system. Embedding methods are commonly used in recommender systems to represent features about users and items. Embedding layers usually maintain billions of embeddings vectors, which make the size of embedding layers up to hundreds of gigabytes. Saec exploits the similarity among features within a field as they represent the same attribute of users or items, and uses clustering to compress the embeddings. The fast clustering method that relies on the empirical heavy-tailed nature of features to drastically reduce the clustering overhead.

• Irina: a novel online inference task scheduling system in the cloud. The different combinations of layers make the models reflect different execution costs when the batch size is growing, and computation and memory are not fully utilized. According to the different reflection, Irina proposes the tasks according to the models in the task. Irina takes completion time under unpredictable workloads as its primary objective. Irina augments the design space of inference task scheduling with three new strategies, namely batching, stacking, and preemption, in order to more flexibly schedule the tasks and reduce overall latency.

We implement Stanza and Saec and evaluate them in the testbed cluster. Comparing to the PS-based and Horovod-based distributed training system, Stanza significantly reduces the training time. Saec achieves a competitive compression rate on the model deployed for product recommendation in Tencent. We are still implementing the prototype of Irina, while the simulation for Irina with real inference tasks also critically reduces the job completion time compared to TensorFlow Serving.