Hao Wu, Yue Yu, and Junxiao Deng, Huazhong University of Science and Technology; Shadi Ibrahim, Inria; Song Wu and Hao Fan, Huazhong University of Science and Technology and Jinyinhu Laboratory; Ziyue Cheng, Huazhong University of Science and Technology; Hai Jin, Huazhong University of Science and Technology and Jinyinhu Laboratory
The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems.
USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Hao Wu and Yue Yu and Junxiao Deng and Shadi Ibrahim and Song Wu and Hao Fan and Ziyue Cheng and Hai Jin},
title = {{StreamBox}: A Lightweight {GPU} {SandBox} for Serverless Inference Workflow},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {59--73},
url = {https://www.usenix.org/conference/atc24/presentation/wu-hao},
publisher = {USENIX Association},
month = jul
}