Five Pitfalls for Benchmarking Big Data Systems

Yanpei Chen; Gwen Shapira

Invited Talk

Wednesday, November 12, 2014 - 4:45pm-5:30pm

Yanpei Chen and Gwen Shapira, Cloudera, Inc.

Abstract:

Performance is an increasingly important attribute of Big Data systems as focus shifts from batch processing to real-time analysis and to consolidated multi-tenant systems. One of the little-understood challenges in scaling data systems is properly defining and measuring performance. The complexity, diversity, and scale of big data systems make this a difficult task and we frequently encounter haphazard benchmarks that lead to bad technology choices, poor purchasing decisions, and suboptimal cluster operations. This talk draws on performance engineering and field services experience from a leading Big Data vendor. We will talk about the most common performance benchmarking pitfalls and share practical advice on how to avoid them with rigorous metrics and measurement methods.

Yanpei Chen, Cloudera Inc.

Yanpei Chen is a member of the Performance Engineering Team at Cloudera, where he works on internal and competitive performance measurement and optimization. His work touches upon multiple interconnected computation frameworks, including Cloudera Search, Cloudera Impala, Apache Hadoop, Apache HBase, and Apache Hive. He is the lead author of the Statistical Workload Injector for MapReduce (SWIM), an open source tool that allows someone to synthesize and replay MapReduce production workloads. SWIM has become a standard MapReduce performance measurement tool used to certify many Cloudera partners. He received his doctorate at the UC Berkeley AMP Lab, where he worked on performance-driven, large-scale system design and evaluation.

Gwen Shapira, Cloudera Inc.

Gwen Shapira is a Solutions Architect at Cloudera. She has 15 years of experience working with customers to design scalable data architectures. Working as a data warehouse DBA, ETL developer and a senior consultant. She specializes in migrating data warehouses to Hadoop, integrating Hadoop with relational databases, building scalable data processing pipelines, and scaling complex data analysis algorithms.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {209077,
author = {Yanpei Chen and Gwen Shapira},
title = {Five Pitfalls for Benchmarking Big Data Systems},
year = {2014},
address = {Seattle, WA},
publisher = {USENIX Association},
month = nov
}

Download

Presentation Video

Presentation Audio

Download Audio

Back to Conference Program

Media Sponsors & Industry Partners

connect with us

help promote

usenix conference policies

Five Pitfalls for Benchmarking Big Data Systems

Yanpei Chen, Cloudera Inc.

Gwen Shapira, Cloudera Inc.

Open Access Media

Presentation Video

Presentation Audio

Gold Sponsors

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

connect with us

why attend lisa?

help promote

sponsors

usenix conference policies

You are here

connect with us

Five Pitfalls for Benchmarking Big Data Systems

Yanpei Chen, Cloudera Inc.

Gwen Shapira, Cloudera Inc.

Open Access Media

Presentation Video

Presentation Audio

Gold Sponsors

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners