All the times listed below are in Pacific Daylight Time (PDT).
Join the USENIX OpML Slack workspace to participate in ask-me-anything (AMA) conversations on each of the sessions below. Already have a Slack account for another workspace? You'll need to create new login credentials for this workspace. View the OpML '20 attendee guide.
Papers and Proceedings
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].
Proceedings Front Matter
Proceedings Cover |
Title Page, Copyright Page, and List of Organizers |
Table of Contents |
Message from the Program Co-Chairs
Tuesday, July 28
9:00 am–10:30 am
Session 1: Deep Learning and GPU Accelerated Data Science
Deep Learning is a critical technology for real-world ML applications. However, DL presents unique challenges for ML Operations. New DL innovations are arriving rapidly, and DL pipelines tend to use expensive hardware like GPUs that benefit from careful optimization. ML Ops challenges range from model compatibility to GPU/CPU resource management.
This session contains a paper presentation and a talk that explore solutions to DL specific operational problems. The first paper presentation describes a compatibility specification that enables the sharing and reproduction of DL jobs. The second talk describes GPU use within an end-to-end data science pipeline, and challenges and solutions within—GPUs are great for deep learning but can accelerate many other workflows as well
Join the discussion on the Session 1 Slack channel.
DLSpec: A Deep Learning Task Exchange Specification
Abdul Dakkak and Cheng Li, University of Illinois at Urbana-Champaign; Jinjun Xiong, IBM; Wen-mei Hwu, University of Illinois at Urbana-Champaign
Deep Learning (DL) innovations are being introduced at a rapid pace. However, the current lack of standard specification of DL tasks makes sharing, running, reproducing, and comparing these innovations difficult. To address this problem, we propose DLSpec, a model-, dataset-, software-, and hardware-agnostic DL specification that captures the different aspects of DL tasks. DLSpec has been tested by specifying and running hundreds of DL tasks.
End-to-End Data Science on GPUs with RAPIDS
John Zedlewski, NVIDIA
In this talk, we'll discuss the impact of GPUs on the complete workflow supporting data science, and how the open source RAPIDS stack unifies ETL, analytics, and model building all within GPU memory without round trips to CPU. We'll emphasize modern networking technologies, like Dask, UCX, and Infiniband that allow this stack to scale out.
John Zedlewski, NVIDIA
John Zedlewski is the director of GPU-accelerated machine learning on the NVIDIA Rapids team. Previously, he worked on deep learning for self-driving cars at NVIDIA, deep learning for radiology at Enlitic, and machine learning for structured healthcare data at Castlight. He has an MA/ABD in economics from Harvard with a focus in computational econometrics and an AB in computer science from Princeton.
Wednesday, July 29
9:00 am–10:30 am
Session 2: Model Life Cycle
Training, or even running, your models in production efficiently isn't enough. Machine learned models have complex dependencies including the training data, versions of derived features and embeddings, skew between offline and online versions of features, and multiple versions of the same model running in production. Keeping this running smoothly and at scale is like conducting an orchestra. In this session, we have presenters from Walmart Labs discussing their approach to help data scientists bring models from their local machines to their production platform; as well as Netflix discussing their system managing over two thousand models in production with capabilities ranging from model discovery, monitoring, and deployment safeguards and rollbacks.
Join the discussion at the Session 2 Slack channel.
Finding Bottleneck in Machine Learning Model Life Cycle
Chandra Mohan Meena, Sarwesh Suman, and Vijay Agneeswaran, WalmartLabs
Our data scientists are adept in using machine learning algorithms and building model out of it, and they are at ease with their local machine to do them. But, when it comes to building the same model from the platform, they find it slightly challenging and need assistance from the platform team. Based on the survey results, the major challenge was platform complexity, but it is hard to deduce actionable items or accurate details to make the system simple. The complexity feedback was very generic, so we decided to break it down into two logical challenges: Education & Training and Simplicity-of-Platform. We have developed a system to find these two challenges in our platform, which we call an Analyzer. In this paper, we explain how it was built and it’s impact on the evolution of our machine learning platform. Our work aims to address these challenges and provide guidelines on how to empower machine learning platform team to know the data scientist’s bottleneck in building model.
Runway - Model Lifecycle Management at Netflix
Eugen Cepoi and Liping Peng, Netflix
In this talk, we are going to present Runway, Netflix’s model lifecycle management system. There are a plethora of Machine Learning models that drive Netflix personalization and Runway is responsible for managing all those models.
Eugen Cepoi, Netflix
Eugen is a Senior Software Engineer part of Netflix's Personalization Infrastructure team. He has been working for the past couple years on building infrastructure for ML applied to Netflix's product personalization.
Liping Peng, Netflix
Liping is a Senior Software Engineer part of Netflix's Personalization Infrastructure team. She has been working for the past couple years on building infrastructure for ML applied to Netflix's product personalization.
Thursday, July 30
9:00 am–10:30 am
Session 3: Features, Explainability, and Analytics
As production ML is used in more industries, businesses need to understand how the ML pipelines intersect with customer concerns such as data management, trust, and privacy. At the technical level, how features are built, evaluated, and managed is critical, as is the ability to monitor and explain ML in production.
In this session, four presentations cover topics of ML explainability, reproducibility, and feature management in production. Learn what it means to have explainable models in production, how to track, manage, and reproduce pipelines, and how to evaluate new ML pipelines!
Join the discussion at the Session 3 Slack channel.
Detecting Feature Eligibility Illusions in Enterprise AI Autopilots
Fabio Casati, Veeru Metha, Gopal Sarda, Sagar Davasam, and Kannan Govindarajan, Servicenow
SaaS Enterprise workflow companies, such as Salesforce and Servicenow, facilitate AI adoption by making it easy for customers to train AI models on top of workflow data, once they know the problem they want to solve and how to formulate it. However, as we experience over and over, it is very hard for customers to have this kind of knowledge for their processes, as it requires an awareness of the business and operational side of the process as well as of what AI could do on each with the specific data. The challenge we address is how to take customers to that stage, and in this paper we focus on a specific aspect of such challenge: the identification of which "useful inferences" AI could make and which process attributes can be leveraged as predictors, based on the data available for that customer.
Time Travel and Provenance for Machine Learning Pipelines
Alexandru A. Ormenisan, KTH - Royal Institute of Technology; Moritz Meister, Fabio Buso, and Robin Andersson, Logical Clocks AB; Seif Haridi and Jim Dowling, KTH - Royal Institute of Technology
Machine learning pipelines have become the defacto paradigm for productionizing machine learning applications as they clearly abstract the processing steps involved in transforming raw data into engineered features that are then used to train models. In this paper, we use a bottom-up method for capturing provenance information regarding the processing steps and artifacts produced in ML pipelines. Our approach is based on replacing traditional intrusive hooks in application code (to capture ML pipeline events) with standardized change-data-capture support in the systems involved in ML pipelines: the distributed file system, feature store, resource manager, and applications themselves. In particular, we leverage data versioning and time-travel capabilities in our feature store to show how provenance can enable model reproducibility and debugging.
An Experimentation and Analytics Framework for Large-Scale AI Operations Platforms
Thomas Rausch, TU Wien; Waldemar Hummer and Vinod Muthusamy, IBM Research AI
This paper presents a trace-driven experimentation and analytics framework that allows researchers and engineers to devise and evaluate operational strategies for large-scale AI workflow systems. Analytics data from a production-grade AI platform developed at IBM are used to build a comprehensive system and simulation model. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of experiments to test and examine pipeline scheduling, cluster resource allocation, or similar operational mechanisms.
Challenges Towards Production-Ready Explainable Machine Learning
Lisa Veiber, Kevin Allix, Yusuf Arslan, Tegawendé F. Bissyandé, and Jacques Klein, SnT – Univ. of Luxembourg
Machine Learning (ML) is increasingly prominent in organizations. While those algorithms can provide near perfect accuracy, their decision-making process remains opaque. In a context of accelerating regulation in Artificial Intelligence (AI) and deepening user awareness, explainability has become a priority notably in critical healthcare and financial environments. The various frameworks developed often overlook their integration into operational applications as discovered with our industrial partner. In this paper, explainability in ML and its relevance to our industrial partner is presented. We then discuss the main challenges to the integration of explainability frameworks in production we have faced. Finally, we provide recommendations given those challenges.
Friday, July 31
9:00 am–10:30 am
Session 4: Algorithms
Why are we covering new algorithms in a Production ML conference? Simple—new ML techniques emerge daily, and in our field, new innovations make it into production reality in record time!
This session covers new algorithmic innovations, from scalable AutoML with Ray to Real-Time Incremental learning, in a practical and production context. From LinkedIn, Intel, and the University of Merced, the talks cover how to improve ML scale, iteration, and speed, how to increase automation and reduce human reliance and thereby deploy new models faster, how to make adaptive ML with incremental learning, and more!
Join the discussion at the Session 4 Slack channel.
RIANN: Real-time Incremental Learning with Approximate Nearest Neighbor on Mobile Devices
Jiawen Liu and Zhen Xie, University of California, Merced; Dimitrios Nikolopoulos, Virginia Tech; Dong Li, University of California, Merced
Approximate nearest neighbor (ANN) algorithms are the foundation for many applications on mobile devices. Real-time incremental learning with ANN on mobile devices is emerging. However, incremental learning with current ANN algorithms on mobile devices is difficult, because data is dynamically and incrementally generated on mobile devices and as a result, it is difficult to reach high timing and recall requirements on indexing and search. Meeting the high timing requirements is critical on mobile devices because of short user response time and low battery lifetime.
We introduce an indexing and search system for graph-based ANN on mobile devices called RIANN. By constructing ANN with dynamic ANN construction properties, RIANN enables high flexibility for ANN construction to meet the high timing and recall requirements in incremental learning. To select an optimal ANN construction property, RIANN incorporates a statistical prediction model. RIANN further offers a novel analytical performance model to avoid runtime overhead and interaction with mobile devices. In our experiments, RIANN significantly outperforms the state-of-the-art ANN (2.42x speedup) on Samsung S9 mobile phone without compromising search time or recall. Also, for incrementally indexing 100 batches of data, the state-of-the-art ANN satisfies 55.33% batches on average while RIANN can satisfy 96.67% with minimum impact on recall.
Rise of the Machines: Removing the Human-in-the-Loop
Viral Gupta and Yunbo Ouyang, LinkedIn
Most large-scale online recommender systems like notifications recommendation, newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest. In this talk we will present how we build generic solution to solve the problem at scale.
Viral Gupta, LinkedIn
Viral Gupta works as a Relevance Tech Lead for near real-time Notifications recommendation on the LinkedIn platform. He works on several problems including onboarding new notification onto the Relevance platform efficiently, improving the overall quality of the recommendations by incorporating multi-objective optimization, etc. He is one of the key players who designed and implemented the scalable hyper-parameter optimization library at LinkedIn. He spends time consulting several teams at LinkedIn on how can the parameter search needs for the specific use-cases be formulated as a multi-objective optimization problem and solved using the hyper-parameter optimization library.
Yunbo Ouyang, LinkedIn
Dr. Yunbo Ouyang is a senior software engineer in LinkedIn AI Foundations team who has expertise and works extensively on automatic hyperparameter tuning. He is one of the key players building LinkedIn’s offline and online hyperpameter tuning libraries. He obtained his Ph.D. in Statistics from University of Illinois at Urbana-Champaign. He has published papers in top conferences such as SIAM International Conference on Data Mining. He has been serving as a reviewer for multiple top conferences such as NeurIPS, KDD and AAAI. He has taught and TAed multiple advanced statistics and machine learning courses in UIUC.
Cluster Serving: Distributed Model Inference using Big Data Streaming in Analytics Zoo
Jiaming Song, Dongjie Shi, Qiyuan Gong, Lei Xia, and Jason Dai, Intel
As deep learning projects evolve from experimentation to production, there is increasing demand to deploy deep learning models for large-scale, real-time distributed inference. While there are many tools available for relevant tasks (such as model optimization, serving, cluster scheduling, workflow management, etc.), it is still a challenging process for many deep learning engineers and scientists to develop and deploy distributed inference workflow that can scale out to large clusters in a transparent fashion.
To address this challenge, we have developed Cluster Serving, an automated and distributed serving solution that supports a wide range of deep learning models (such as TensorFlow, PyTorch, Caffe, BigDL, and OpenVINO). It provides simple publish-subscribe (pub/sub) and REST APIs, through which users can easily send their inference requests to the input queue using simple Python or HTTP APIs. Cluster Serving will then automatically manage the scale-out and real-time model inference across a large cluster, using distributed Big Data streaming frameworks (such as Apache Spark Streaming and Apache Flink).
In this talk, we will present the architecture design for Cluster Serving, and discuss the underlying design patterns and tradeoffs to deploy deep learning models on distributed Big Data streaming frameworks in production. In addition, we will also share real-world experience and "war stories" of users who have adopted Cluster Serving to develop and deploy distributed inference workflow.
Jiaming Song, Intel
Mr. Song Jiaming is a Machine Learning engineer at Intel, with over 2 years of experience in machine learning and big data. He is a key contributor to open source Big Data + AI project Analytics Zoo. He is now focusing on the development of Cluster Serving.
Jason Dai, Intel
Jason Dai is a senior principal engineer and CTO of Big Data Technologies at Intel, responsible for leading the global engineering teams (in both Silicon Valley and Shanghai) on the development of advanced data analytics and machine learning. He is the creator of BigDL and Analytics Zoo, a founding committer and PMC member of Apache Spark, and a mentor of Apache MXNet. For more details, please see https://jason-dai.github.io/.
Scalable AutoML for Time Series Forecasting using Ray
Shengsheng Huang and Jason Dai, Intel
Time Series Forecasting is widely used in real world applications, such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, and etc. Recently there's a trend to apply machine learning and deep learning methods to such problems, and there's evidence that they can outperform traditional methods (such as autoregression and exponential smoothing) in several well-known competitions and real-world use cases.
However, building the machine learning applications for time series forecasting can be a laborious and knowledge-intensive process. In order to provide an easy-to-use time series forecasting toolkit, we have applied Automated Machine Learning (AutoML) to time series forecasting. The toolkit is built on top of Ray (a distributed framework for emerging AI applications open-sourced by UC Berkeley RISELab), so as to automate the process of feature generation and selection, model selection and hyper-parameter tuning in a distributed fashion. In this talk we will share how we build the AutoML toolkit for time series forecasting, as well as real-world experience and take aways from earlier users.
Shengsheng Huang, Intel
Shengsheng Huang's Bio: Shengsheng (Shane) Huang is a senior software architect of BigData and AI techologies at Intel. She is an Apache Spark committer and PMC member, and is a key contributor to open source Big Data + AI projects Analytics-zoo (https://github.com/intel-analytics/analytics-zoo) and BigDL(https://github.com/intel-analytics/BigDL). Now at Intel, she leads development of algorithms and customer applications focusing on NLP, AutoML and time series analysis
Jason Dai, Intel
Jason Dai is a senior principal engineer and CTO of Big Data Technologies at Intel, responsible for leading the global engineering teams (in both Silicon Valley and Shanghai) on the development of advanced data analytics and machine learning. He is the creator of BigDL and Analytics Zoo, a founding committer and PMC member of Apache Spark, and a mentor of Apache MXNet. For more details, please see https://jason-dai.github.io/.
Tuesday, August 4
9:00 am–10:30 am
Session 5: Model Deployment Strategies
It doesn't matter how good your data science and machine learning engineering teams are if you can't run your models in production! Whether you're working with personalized customer date, real-time sensing, or search and ranking—rapidly changing features and combinatoric complexity often rule out just computing everything offline.
In this session, we have presenters from three top-tier internet companies, Intuit, Netflix, Adobe, the United States Air Force, and Clarkson University for running and managing models in production. Topics include techniques for directly running PyTorch models as RESTful endpoints, trade-offs between model execution strategies and intermediate formats such as ONNX, an open source system enabling data scientists and others to bring their models to production without deep systems experience, and techniques for managing and running production models at scale.
Join the discussion at the Session 5 Slack channel.
FlexServe: Deployment of PyTorch Models as Flexible REST Endpoints
Edward Verenich, Clarkson University; Alvaro Velasquez, Air Force Research Laboratory; M. G. Sarwar Murshed and Faraz Hussain, Clarkson University
The integration of artificial intelligence capabilities into modern software systems is increasingly being simplified through the use of cloud-based machine learning services and representational state transfer architecture design. However, insufficient information regarding underlying model provenance and the lack of control over model evolution serve as an impediment to more widespread adoption of these services in operational environments which have strict security requirements. Furthermore, although tools such as TensorFlow Serving allow models to be deployed as RESTful endpoints, they require the error-prone process of converting the PyTorch models into static computational graphs needed by TensorFlow. To enable rapid deployments of PyTorch models without the need for intermediate transformations, we have developed FlexServe, a simple library to deploy multi-model ensembles with flexible batching.
Managing ML Models @ Scale - Intuit’s ML Platform
Srivathsan Canchi and Tobias Wenzel, Intuit Inc.
At Intuit, machine learning models are derived from huge, sensitive data sets that are continuously evolving, which in turn requires continuous model training and tuning with a high level of security and compliance. Intuit’s Machine Learning Platform provides Model LifeCycle management capabilities that are scalable and secure using GitOps, SageMaker, Kubernetes and Argo Workflows.
In this talk, we’ll go over the model management problem statement at Intuit, data science/MLE needs vs Intuit’s enterprise needs, provide an introduction to our model management interface and self serve capabilities. This talk will cover aspects of our platform such as feature management and processing, bill backs, collaborations and separation of operational concerns between platform and model. These capabilities of the platform have enabled model publishing velocity increases of over 200%, and this talk will illustrate how we got there.
Srivathsan Canchi, Intuit Inc.
Srivathsan Canchi leads the machine learning platform engineering team at Intuit. The ML platform includes real-time distributed featurization, scoring and feedback loops. He has a breadth of experience building high scale mission critical platforms. Srivathsan also has extensive experience with K8s at Intuit and previously at eBay, where his team was responsible for building a PaaS on top of K8s and OpenStack.
Tobias Wenzel, Intuit Inc.
Tobias Wenzel is a Software Engineer for the Intuit Machine Learning Platform in Mountain View, California. He has been working on the platform since its inception in 2016 and has helped design and build it from the ground up. In his job he has focused on operational excellence of the platform and bringing it successfully through Intuit's seasonal business. In addition, he is passionate about continuously expanding the platform with the latest technologies.
Edge Inference on Unknown Models at Adobe Target
Georgiana Copil, Iulian Radu, and Akash Maharaj, Adobe
Customer’s Data Scientist: “I know much better my business, just give me the data and I’ll create the model you should run.” Transforming this into reality, in our production systems, comes with a lot of challenges, which we will discuss in this talk.
In today’s world, increasingly many companies build their own data science/ML departments. When needing to run their custom models on different systems, the models need to be converted to other frameworks or to a format interpretable in representation standards for machine learning models (e.g., ONNX). In this talk we discuss challenges and approaches to using such models in real-time, low-latency systems. We discuss the limitations of existing frameworks, scoring runtimes, model representations, and the existing solutions to overcome them. We discuss how these methods can be used today to build a solution that provides real-time scoring for high throughput workload.
Georgiana Copil, Adobe
Georgiana Copil is a computer scientist working on Adobe Target, on the experience selection engine for Adobe's Experience Cloud. Before that, she worked as a consultant for SAP on innovation projects, and as university assistant at Vienna University of Technology (TU Wien), on research projects with focus on applying machine learning models to cloud elasticity. With a PhD in informatics from TU Wien, she is currently focusing on topics on the edge of machine learning, distributed systems, and operations.
Iulian Radu, Adobe
Iulian Radu is a senior computer scientist working on Adobe Target. In the past he's been part of the big data processing team for Adobe Audience Manager where he experimented with running machine learning and data processing jobs at scale. Before that, he wrote compilers, applied machine learning algorithms for game AI and played with image processing and object recognition algorithms.
Akash Maharaj, Adobe
Akash Maharaj is a senior data scientist working on Adobe Target, the experience selection engine for Adobe's Experience Cloud. For the past three yars, he has worked on improving the quality and speed of several recommendations and personalization algorithms that power Adobe Target's B2B solutions. Before that, he completed his PhD in physics at Stanford.
More Data Science, Less Engineering: A Netflix Original
Savin Goyal, Netflix
Data Science usage at Netflix goes much beyond our eponymous recommendation systems. It touches almost all aspects of our business - from optimizing content delivery to making our infrastructure more resilient to failures and beyond. Also, our unique culture affords our data scientists extraordinary freedom of choice in ML tools and libraries, all of which results in an ever-expanding set of interesting problem statements and a diverse set of ML approaches to tackle them. Our data scientists, at the same time, are expected to build, deploy, and operate complex ML workflows autonomously without the need to be significantly experienced with systems or data engineering.
In this talk, we discuss the infrastructure available to our data scientists focused on providing an improved development and deployment experience for ML workflows. We focus on Metaflow (now open source at metaflow.org), our ML framework, which offers delightful abstractions to manage the model’s lifecycle end-to-end and how our culture and focus on human-centric design affects our data scientist’s velocity.
Wednesday, August 5
9:00 am–10:30 am
Session 6: Applications and Experiences
Operational ML is a wide-ranging space where many technologies, platforms, and approaches are available. Some challenges cannot be seen until you actually try it out. Early adopters and their solutions are great design patterns for others to follow and learn from!
This session contains four such examples of ML Ops in real life, with talks from Mercari, Google, VMware, and Nvidia. You will learn how real-world scalable ML Ops works at scale and in applications ranging from E-commerce to Self Driving Cars, and the experiences, best practices, and solutions within!
Join the discussion at the Session 6 Slack channel.
Auto Content Moderation in C2C e-Commerce
Shunya Ueta, Suganprabu Nagaraja, and Mizuki Sango, Mercari, inc.
Consumer-to-consumer (C2C) e-Commerce is a large and growing industry with millions of monthly active users. In this paper, we propose auto content moderation for C2C e-Commerce to moderate items using Machine Learning (ML). We will also discuss practical knowledge gained from our auto content moderation system. The system has been deployed to production at Mercari since late 2017 and has significantly reduced the operation cost in detecting items violating our policies. This system has increased coverage by 554.8 % over a rule-based approach.
Inside NVIDIA’s AI Infrastructure for Self-driving Cars
Clement Farabet and Nicolas Koumchatzky, NVIDIA
We'll discuss Project MagLev, NVIDIA's internal end-to-end AI platform for developing its self-driving car software, DRIVE. We'll explore the platform that supports continuous data ingest from multiple cars producing TB of data per hour. We'll also cover how the platform enables autonomous AI designers to iterate training of new neural network designs across thousands of GPU systems and validate the behavior of these designs over multi PB-scale data sets. We will talk about our overall architecture for everything from data center deployment to AI pipeline automation, as well as large-scale AI dataset management, AI training, and testing.
Clement Farabet, NVIDIA
Clement Farabet is vice president of AI infrastructure at NVIDIA. He received a Ph.D. from Universite Paris-Est in 2013, co-advised by Laurent Najman and Yann LeCun. His thesis focused on real-time image understanding, introducing multi-scale convolutional neural networks and a custom hardware architecture for deep learning. Clement co-founded Madbits, a startup working on web-scale image understanding that was sold to Twitter in 2014. He is also cofounder of Twitter Cortex, a team focused on building Twitter's deep learning platform for recommendations, search, spam, NSFW content, and ads.
Nicolas Koumchatzky, NVIDIA
Nicolas Koumchatzky is a Director of AI Infrastructure at NVIDIA. He is currently managing an organization building a cloud AI platform to power the development of Autonomous Vehicles. Previously, he was managing Twitter’s centralized AI Platform team Twitter Cortex.
Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments
Amitabha Banerjee, Chien-Chia Chen, Chien-Chun Hung, Xiaobo Huang, Yifan Wang, and Razvan Chevesaran, VMware Inc
This paper presents how VMware addressed the following challenges in operationalizing our ML-based performance diagnostics solution in enterprise hybrid-cloud environments: data governance, model serving and deployment, dealing with system performance drifts, selecting model features, centralized model training pipeline, setting the appropriate alarm threshold, and explainability. We also share the lessons and experiences we learned over the past four years in deploying ML operations at scale for enterprise customers.
Automating Operations with ML
Todd Underwood and Steven Ross, Google
Engineers have been attracted to the idea of using Machine Learning to control their applications and infrastructure. Unfortunately, the majority of proposed uses of ML for production engineering are unsuited for the stated purpose. They generally fail to account for several structural limitations of the proposed application, including failure to account for error rate, cost versus failure and most generally insufficient number of labeled examples.
We will review the common proposed applications of Machine Learning to production control including: anomaly detection, monitoring/alerting, capacity prediction, security, and resource scaling. For each we will use experience to demonstrate the limitations that ML modeling techniques have. We will identify one application with the best results.
We will end with specific recommendations for how organizations can get ready to take advantage of ML for their production operations in the future.
Todd Underwood, Google
Todd Underwood is a lead Machine Learning for Site Reliability Engineering Director at Google and is a Site Lead for Google’s Pittsburgh office. ML SRE teams build and scale internal and external ML services and are critical to almost every product area at Google. Todd was in charge of operations, security, and peering for Renesys’s Internet intelligence services that is now part of Oracle’s cloud service. Before that Todd was Chief Technology Officer of Oso Grande in New Mexico. Todd has a BA in philosophy from Columbia University and a MS in computer science from the University of New Mexico. He is interested in how to make computers and people work much, much better together.
Steven Ross, Google
Steven Ross is a tech lead in Site Reliability Engineering for Google in Pittsburgh, and has worked on Machine Learning at Google since Pittsburgh Pattern Recognition was acquired by Google in 2011. Before that he worked as a Software Engineer for Dart Communications, Fishtail Design Automation, and then Pittsburgh Pattern Recognition until Google acquired it. Steven has a B.S. from Carnegie Mellon University (1999) and an M.S. in Electrical and Computer Engineering from Northwestern University (2000). He is interested in mass-producing Machine Learning models.
Thursday, August 6
9:00 am–10:30 am
Session 7: Model Training
Both the complexity of machine learning models and the raw amount of data from which they are trained are constantly increasing. In this session, we will have presentations ranging from managing the raw Spark and Kubernetes infrastructure to train the models at Intuit, Bayesian optimization approaches for continuously improving your model training efficiency over time from SigOpt, and insights gleaned at Google from fifteen years of model training and model execution outages.
Join the discussion at the Session 7 Slack channel.
How ML Breaks: A Decade of Outages for One Large ML Pipeline
Daniel Papasian and Todd Underwood, Google
Reliable management of continuous or periodic machine learning pipelines at large scale presents significant operational challenges. Using experience from almost 15 years of operating some of the largest ML pipelines, we examine the characteristics of one of the largest and oldest continuous pipeline at Google. We look at actual outages experienced and try to understand what caused them.
We examine failures in detail, categorizing them into ML vs Non-ML and Distributed vs. Non-Distributed. We demonstrate that a majority of the outages are not ML-centric and are more related to the distributed character of the pipeline.
Daniel Papasian, Google
Daniel Papasian is a Staff Software Engineer at Google, working in Site Reliability Engineering. He has spent ten years at Google working on large scale data processing and machine learning systems, in both Site Reliability Engineering roles and as a Software Engineer in the Ads Quality organization. Before Google, he worked as a Network System Engineer for Carnegie Mellon's Computing Services, writing software to automate network reconfiguration. Prior to that, he was staff for the Chronicle of Higher Education, in charge of all things technical for their website, chronicle.com. He holds a BS from Carnegie Mellon University with majors in the decision sciences and a minor in engineering.
Todd Underwood, Google
Todd Underwood is a lead Machine Learning for Site Reliability Engineering Director at Google and is a Site Lead for Google’s Pittsburgh office. ML SRE teams build and scale internal and external ML services and are critical to almost every product area at Google. Todd was in charge of operations, security, and peering for Renesys’s Internet intelligence services that is now part of Oracle’s cloud service. Before that Todd was Chief Technology Officer of Oso Grande in New Mexico. Todd has a BA in philosophy from Columbia University and a MS in computer science from the University of New Mexico. He is interested in how to make computers and people work much, much better together.
SPOK - Managing ML/Big Data Spark Workloads at scale on Kubernetes
Nagaraj Janardhana and Mike Arov, Intuit
At Intuit, customer data sets are growing exponentially with the growth of the business and the capabilities offered. We built an elastic platform SpoK (Spark on Kubernetes) to run Jupyter notebooks, Data Processing, Feature Engineering, distributed training jobs, batch model inference and model evaluation workflows on Spark using Kubernetes as the resource manager.
With the whole organization moving to Kubernetes for running the services workload, we saw an opportunity to run the ML workloads as well on Kubernetes for simplified management of the cluster operations, bring the goodness of containers to data processing, scalable infrastructure, cost and efficiency improvements and also to reuse the CI/CD, security certification tooling already built. This migration from EMR/Yarn to Kubernetes has improved the developer productivity by reducing the time to deploy from more than 7 days to less than a day. Provided cost improvements in the range of 25~30%. Eased Cluster Operations Management as all types of workloads share the same cluster.
Nagaraj Janardhana, Intuit
Nagaraj is Principal engineer at Intuit, Mountain View responsible for designing and developing ML and Featurization platforms. In the past he has been involved with developing Data Ingestion and Processing platforms, Identity and Subscription Platforms at Intuit. He has contributed to the Spinnaker open source project.
Mikhail Arov, Intuit
Mike is a Staff ML Engineer at Intuit, Mountain View. He was responsible for development and deployment of many models for Cash Flow Forecasting, Mileage and Expense classification and Marketing propensity ML models. Big advocate for K8s and Argo in ML, he pioneered the use of Spark on Kubernetes for Intuit Data Platform and especially ML.
Addressing Some of Challenges When Optimizing Long-to-Train-Models
Tobias Andreasen, SigOpt
As machine learning models become more complex and require longer training cycles, optimizing and maximizing performance can sometimes be seen as an intractable problem - this tends to leave a lot of performance unrealized.
The challenge oftentime becomes that most common methods for hyperparameter optimization are either sample efficient or they are able to efficiently parallelize. This either leads to a choice between a very long optimization process with good performance, or a very short but efficient optimization process with suboptimal performance.
Further, another challenge becomes justifying the cost of optimizing these oftentime long-to-train-models, because in most situations this has to be done on a per model basis with non of information gained being leverage in the future.
This talk outlines ways in which these challenges can be addressed, when thinking about bringing optimal performing models into production.
Tobias Andreasen, SigOpt
Tobias Andreasen is a Machine Learning Specialist at the San Francisco based startup, SigOpt. A company with the vision to “Accelerate and amplify the impact of modelers everywhere” through the development of software for optimization and experimentation. On a day-to-day basis Tobias is working with a large set of companies, on how to come up with the best approach to optimizing for things like their machine learning models in order to meet their business constraints and requirements.
Tobias holds a master’s and bachelor’s degree from the Technical University of Denmark within the field of applied mathematics.
Friday, August 7
9:00 am–10:30 am
Session 8: Bias, Ethics, Privacy
It isn't enough to get great modeling performance and run your pipelines well—with great modeling power comes great responsibility. Who are our models leaving out? Who are they harming? How are we as companies protecting the privacy of our members and customers? Even if the data is protected, can private information leak out through the models themselves? Vincent Pham and Nahid Ghalaty from Capital One discuss various attacks (and defenses) against machine learning models to infer the training data and company secrets with access only to the model or its hyperparameters. Yan from Facebook will discuss their approach for standardized company-wide ownership of ML assets—critical to enforcing data privacy.
Join the discussion at the Session 8 Slack channel.
"SECRETS ARE LIES, SHARING IS CARING, PRIVACY IS THEFT." - A Dive into Privacy Preserving Machine Learning
Nahid Farhady Ghalaty and Vincent Pham, Capital One
Machine Learning coupled with the new cloud and serverless technologies has enabled organizations to leverage big data analytics to create predictive and recommendation platforms at a larger scale for different applications. However, an often overlooked danger with all this exciting technology is the privacy of the data and privacy attacks on machine learning models.
In this talk, we identify and explore different points of leakage in machine learning models that can be exploited for privacy attacks such as attacks on training data, or model inversion attacks. Using the information on the attacks, we also propose the methods for model developers to protect their data and models. These methods are able to camouflage the main operations and computations of machine algorithms by injecting noise and dummy instructions in between. The key takeaway for the audience of this talk is to be able to learn and identify threats to the models they develop along with a demo of an attack and defense on a financial application.
Nahid Farhady Ghalaty, Capital One
Nahid Farhady is Machine Learning Engineer at the CyberML team within Capital One. She has obtained her PhD in Electrical and Computer Engineering from Virginia Tech in 2016. Her research interest is on embedded systems security, Fault Attacks and Side Channel Attacks, and Cryptography. Her research has been published in several peer reviewed conferences and journals such as DATE, FDTC, and IEEE ESL.
Vincent Pham, Capital One
Vincent Pham is also a Machine Learning Engineer at the CyberML team. He obtained his Masters in Analytics at The University of San Francisco. Over the past four years at Capital One, he has also worked on other machine learning domains such as generative adversarial networks, fraudulent merchant detector via label propagation, and reinforcement learning with AWS deep racer.
ML Artifacts Ownership Enforcement
Yan Yan, Facebook
This talk is about Machine Learning Artifacts Ownership Enforcement. Privacy is the first priority for machine learning. Building ML artifacts ownership is the first step to ensure it. My talk is about challenges and solutions we had to enforce ML Artifacts ownership.
Yan Yan, Facebook
Yan Yan has been a production engineer in Facebook for 2+ years, focusing on solving Ads machine learning operational challenges with tooling and services. Before Facebook, Yan graduated from UCLA with master degree of computer science.