Disdat: Bundle Data Management for Machine Learning Pipelines

Authors: 

Ken Yocum, Sean Rowan, and Jonathan Lunt, Intuit, Inc.; Theodore M. Wong, 23andMe, Inc.

Abstract: 

Modern machine learning pipelines can produce hundreds of data artifacts (such as features, models, and predictions) throughout their lifecycle. During that time, data scientists need to reproduce errors, update features, re-train on specific data, validate / inspect outputs, and share models and predictions. Doing so requires the ability to publish, discover, and version those artifacts.

This work introduces Disdat, a system to simplify ML pipelines by addressing these data management challenges. Disdat is built on two core data abstractions: bundles and contexts. A bundle is a versioned, typed, immutable collection of data. A context is a sharable set of bundles that can exist on local and cloud storage environments. Disdat provides a bundle management API that we use to extend an existing workflow system to produce and consume bundles. This bundle-based approach to data management has simplified both authoring and deployment of our ML pipelines.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {232973,
author = {Ken Yocum and Sean Rowan and Jonathan Lunt and Theodore M. Wong},
title = {Disdat: Bundle Data Management for Machine Learning Pipelines},
booktitle = {2019 USENIX Conference on Operational Machine Learning (OpML 19)},
year = {2019},
isbn = {978-1-939133-00-7},
address = {Santa Clara, CA},
pages = {35--37},
url = {https://www.usenix.org/conference/opml19/presentation/yocum},
publisher = {USENIX Association},
month = may
}