Stitching Weight-Shared Deep Neural Networks for Efficient Multitask Inference on GPU

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

3 Scopus Citations
View graph of relations

Author(s)

  • Zeyu Wang
  • Xiaoxi He
  • Xu Wang
  • Qiang Ma
  • Xin Miao
  • Zhuo Liu
  • Lothar Thiele
  • Zheng Yang

Detail(s)

Original languageEnglish
Title of host publication2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking, SECON 2022
PublisherInstitute of Electrical and Electronics Engineers, Inc.
Pages145-153
ISBN (electronic)9781665486439
ISBN (print)978-1-6654-8644-6
Publication statusPublished - 2022
Externally publishedYes

Publication series

NameAnnual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks workshops
ISSN (Print)2155-5486
ISSN (electronic)2155-5494

Conference

Title19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON 2022)
LocationVirtual
PlaceSweden
CityStockholm
Period20 - 23 September 2022

Abstract

Intelligent personal and home applications demand multiple deep neural networks (DNNs) running on resourceconstrained platforms for compound inference tasks, known as multitask inference. To fit multiple DNNs into low-resource devices, emerging techniques resort to weight sharing among DNNs to reduce their storage. However, such reduction in storage fails to translate into efficient execution on common accelerators such as GPUs. Most DNN graph rewriters are blind for multi-DNN optimization, while GPU vendors provide inefficient APIs for parallel multi-DNN execution at runtime. A few prior graph rewriters suggest cross-model graph fusion for low-latency multi-DNN execution. Yet they request duplication of the shared weights, erasing the memory saving of weight-shared DNNs. In this paper, we propose MTS, a novel graph rewriter for efficient multitask inference with weight-shared DNNs. MTS adopts a model stitching algorithm which outputs a single computational graph for weight-shared DNNs without duplicating any shared weight. MTS also utilizes a model grouping strategy to avoid overwhelming the GPU when co-running tens of DNNs. Extensive experiments show that MTS accelerates multitask inference by up to 6.0× compared to sequentially executing multiple weightshared DNNs. MTS also yields up to 2.5× lower latency and 3.7× less memory usage compared with NETFUSE, a state-of-the-art multi-DNN graph rewriter. © 2022 IEEE.

Research Area(s)

  • Deep Neural Networks, Model Acceleration, Multitask Inference

Citation Format(s)

Stitching Weight-Shared Deep Neural Networks for Efficient Multitask Inference on GPU. / Wang, Zeyu; He, Xiaoxi; Zhou, Zimu et al.
2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking, SECON 2022. Institute of Electrical and Electronics Engineers, Inc., 2022. p. 145-153 (Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks workshops).

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review