Preemptive GPU Inference for DNNs on Emerging Mobile AI Devices

Project: Research

View graph of relations

Description

GPUs become the primary executor and accelerator for running DNN tasks on emerging mobile artificial intelligence (mAI) devices, e.g., Nvidia Jetsons and more. In addition to accelerating inference on a single task, its significance also begins to be reflected more in the efficient execution of multiple DNN tasks required by recent applications, in which many tasks often have strict latency requirements.Preemption is the main technology to ensure the timeliness of multitasking, by allowing high-priority tasks to interrupt ongoing low-priority tasks. It ensures that a processor can fully exploit its inherent computing capability, as the hardware undergoes huge investment in improvements every year, but the achieved task timeliness can be easily compromised due to inefficient scheduling. Existing mAI GPUs have only two levels of priority in device queue, and state-of-the-art methods can thus achieve coarse-grained preemption merely by dividing DNN tasks into two types: real-time and best-effort, and allowing a real-time task to preempt best-effort tasks. However, the efficacy drops significantly when other real-time tasks run concurrently (as additional high-priority contenders), but this is already common on mAI devices for autonomous-vehicles, robots, UAVs, etc.Can we use solutions from other platforms or processors? Due to different workload and hardware features, they are not applicable. First, mAI is a new form of mobile device, but traditional mobiles usually have only one dominant front-end application and simply follow FIFO in GPU scheduling. Other platforms (e.g., clouds) support preemption for concurrent tasks, but they focus on allocating one or more GPUs to each complex model, whereas on mAI devices, multiple DNNs mainly compete for one GPU. Second, preemption has matured on CPU. Due to hardware differences, CPU preemption only requires saving task’s context from dozens of registers to memory, but such overhead increases significantly to hundreds/thousands of registers on GPUs, which slows down context-switching and thus increases DNN-inference latency.In this project, we propose a middleware design to work directly with commodity mAI GPUs to provide general and fine-grained preemption, allowing both real-time tasks to preempt each other, and real-time tasks to preempt best-effort tasks. Our main finding is that efficient preemption can be achieved through software designs by carefully pruning decision space, in-depth analysis of DNN structure and innovative model adaptation. Overall, enabling fine-grained GPU-preemptive DNN inference for mAI devices is inherently novel. It provides a crucial service to accommodate the important trend of more concurrent DNNs in emerging applications. 

Detail(s)

Project number9043679
Grant typeGRF
StatusNot started
Effective start/end date1/11/24 → …