ChonLam Lao, Tsinghua University; Yanfang Le and Kshiteej Mahajan, University of Wisconsin-Madison; Yixi Chen and Wenfei Wu, Tsinghua University; Aditya Akella and Michael Swift, University of Wisconsin-Madison
Awarded Best Paper!Distributed deep neural network training (DT) systems are widely deployed in clusters where the network is shared across multiple tenants, i.e., multiple DT jobs. Each DT job computes and aggregates gradients. Recent advances in hardware accelerators have shifted the the performance bottleneck of training from computation to communication. To speed up DT jobs' communication, we propose ATP, a service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP uses emerging programmable switch hardware to support in-network aggregation at multiple rack switches in a cluster to speedup DT jobs. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources. ATP outperforms existing systems accelerating training throughput by up to 38% - 66% in a cluster shared by multiple DT jobs.
NSDI '21 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {ChonLam Lao and Yanfang Le and Kshiteej Mahajan and Yixi Chen and Wenfei Wu and Aditya Akella and Michael Swift},
title = {{ATP}: In-network Aggregation for Multi-tenant Learning},
booktitle = {18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21)},
year = {2021},
isbn = {978-1-939133-21-2},
pages = {741--761},
url = {https://www.usenix.org/conference/nsdi21/presentation/lao},
publisher = {USENIX Association},
month = apr
}