Compute Engine Testing with Synthetic Data Generation

Monday, June 03, 2024 - 4:20 pm4:40 pm

Jiangnan Cheng and Eric Liu, Meta

Abstract: 

At Meta, we have developed a new testing framework that utilizes privacy-safe and production-like synthetic data to detect regressions in various compute engines, such as Presto, within the Meta Data Warehouse. In this talk, we will discuss the challenges and solutions we have implemented to operate this framework at scale. We will also highlight key features of our synthetic data generation process, including the addition of differential privacy, expanded column schema support, and improved scalability. Finally, we will discuss how Meta leverages this testing framework to increase test coverage, reduce the Presto release cycle, and prevent production regressions.

Jiangnan Cheng, Meta

Jiangnan Cheng is a research scientist in Applied Privacy Technology Team at Meta. He's interested in developing various privacy enhancement technologies such as synthetic data generation and differential privacy. Before joining Meta, he received his PhD degree in Electrical and Computer Engineering from Cornell University.

Eric Liu, Meta

Eric Liu is a software engineer at Meta. He works in the Presto team, with a strong interest in improving and consolidating testing solutions across Compute Engines in Data Warehouse. Before joining Meta, Eric Liu worked as a Chief Engineer and TLM for ADP.

BibTeX
@conference {296341,
author = {Jiangnan Cheng and Eric Liu},
title = {Compute Engine Testing with Synthetic Data Generation},
year = {2024},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = jun
}