Reliability at Massive Scale: Lessons Learned at Facebook

Robert Johnson; Sanjeev Kumar

Reliability at Massive Scale: Lessons Learned at Facebook

Abstract:

As the Facebook Web site and platform grow to an ever larger scale, one of the most difficult challenges is running reliably while constantly changing our product. Over the years we have developed a number of principles around avoiding large failures while making frequent, small changes to our system. These principles have allowed us to run with a low rate of serious incidents, but they still do occur. I'll be walking through the details of a recent site outage to illustrate the way these principles work and how things can go wrong when they aren't followed.

Robert Johnson, Director of Engineering, Facebook, Inc.

Sanjeev Kumar, Engineering Manager, Facebook, Inc.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {267059,
author = {Robert Johnson and Sanjeev Kumar},
title = {Reliability at Massive Scale: Lessons Learned at Facebook},
year = {2010},
address = {San Jose, CA},
publisher = {USENIX Association},
month = nov
}

Download

Presentation Video

Presentation Audio

Download Audio

Links

Paper:

Paper (HTML):

Slides: