Zhihao Bai and Zhen Zhang, Johns Hopkins University; Yibo Zhu, ByteDance Inc.; Xin Jin, Johns Hopkins University
Deep learning (DL) workloads include throughput-intensive training tasks and latency-sensitive inference tasks. The dominant practice today is to provision dedicated GPU clusters for training and inference separately. Due to the need to meet strict Service-Level Objectives (SLOs), GPU clusters are often over-provisioned based on the peak load with limited sharing between applications and task types.
We present PipeSwitch, a system that enables unused cycles of an inference application to be filled by training or other inference applications. It allows multiple DL applications to time-share the same GPU with the entire GPU memory and millisecond-scale switching overhead. With PipeSwitch, GPU utilization can be significantly improved without sacrificing SLOs. We achieve so by introducing pipelined context switching. The key idea is to leverage the layered structure of neural network models and their layer-by-layer computation pattern to pipeline model transmission over the PCIe and task execution in the GPU with model-aware grouping. We also design unified memory management and active-standby worker switching mechanisms to accompany the pipelining and ensure process-level isolation.We have built a PipeSwitch prototype and integrated it with PyTorch. Experiments on a variety of DL models and GPU cards show that PipeSwitch only incurs a task startup overhead of 3.6–6.6 ms and a total overhead of 5.4–34.6 ms (10–50× better than NVIDIA MPS), and achieves near 100% GPU utilization.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Zhihao Bai and Zhen Zhang and Yibo Zhu and Xin Jin},
title = {{PipeSwitch}: Fast Pipelined Context Switching for Deep Learning Applications},
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {499--514},
url = {https://www.usenix.org/conference/osdi20/presentation/bai},
publisher = {USENIX Association},
month = nov
}