Abhishek Vijaya Kumar and Muthian Sivathanu, Microsoft Research India
We introduce Quiver, an informed storage cache for deep learning training (DLT) jobs in a cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to achieve much higher efficieny compared to a generic storage cache. First, Quiver uses a secure hash-based addressing to transparently reuse cached data across multiple jobs and even multiple users operating on the same dataset. Second, by co-designing with the deep learning framework (\eg, PyTorch), Quiver employs a technique of {\em substitutable cache hits} to get more value from the existing contents of the cache, thus avoiding cache thrashing when cache capacity is much smaller than the working set. Third, Quiver dynamically prioritizes cache allocation to jobs that benefit the most from the caching. With a prototype implementation in PyTorch, we show that Quiver can significantly improve throughput of deep learning workloads.
FAST '20 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Abhishek Vijaya Kumar and Muthian Sivathanu},
title = {Quiver: An Informed Storage Cache for Deep Learning},
booktitle = {18th USENIX Conference on File and Storage Technologies (FAST 20)},
year = {2020},
isbn = {978-1-939133-12-0},
address = {Santa Clara, CA},
pages = {283--296},
url = {https://www.usenix.org/conference/fast20/presentation/kumar},
publisher = {USENIX Association},
month = feb
}