Pantheon : Preemptible Multi-DNN Inference on Mobile Edge GPUs

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

View graph of relations

Detail(s)

Original languageEnglish
Title of host publicationMOBISYS '24
Subtitle of host publicationProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services
PublisherAssociation for Computing Machinery
Pages465-478
Number of pages14
ISBN (print)979-8-4007-0581-6
Publication statusPublished - Jun 2024

Conference

Title22nd ACM International Conference on Mobile Systems, Applications, and Services (ACM MobiSys 2024)
LocationToranomon Hills Forum
PlaceJapan
CityTokyo
Period3 - 7 June 2024

Abstract

GPUs are increasingly utilized for running DNN tasks on emerging mobile edge devices. Beyond accelerating single task inference, their value is also particularly apparent in efficiently executing multiple DNN tasks, which often have strict latency requirements in applications. Preemption is the main technology to ensure multitasking timeliness, but mobile edges primarily offer two priorities for task queues, and existing methods thus achieve only coarse-grained preemption by categorizing DNNs into real-time and best-effort, permitting a real-time task to preempt best-effort ones. However, the efficacy diminishes significantly when other real-time tasks run concurrently, but this is already common in mobile edge applications. Due to different hardware characteristics, solutions from other platforms are unsuitable. For instance, GPUs on traditional mobile devices primarily assist CPU processing and lack special preemption support, mainly following FIFO in GPU scheduling. Clouds handle concurrent task execution, but focus on allocating one or more GPUs per complex model, whereas on mobile edges, DNNs mainly vie for one GPU. This paper introduces Pantheon, designed to offer fine-grained preemption, enabling real-time tasks to preempt each other and best-effort tasks. Our key observation is that the two-tier GPU stream priorities, while underexplored, are sufficient. Efficient preemption can be realized through software design by innovative scheduling and novel exploitation of the nested redundancy principle for DNN models. Evaluation on a diverse set of DNNs shows substantial improvements in deadline miss rate and accuracy of Pantheon over state-of-the-art methods.

© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Research Area(s)

  • Mobile Edge Systems, GPU Scheduling, Preemption, Deep Learning

Citation Format(s)

Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs. / Han, Lixiang; Zhou, Zimu; Li, Zhenjiang.
MOBISYS '24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services. Association for Computing Machinery, 2024. p. 465-478.

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review