sponsors
general information
Early Bird Registration Deadline: March 16, 2016
SREcon16 is SOLD OUT.
No walkup registrations will be accepted.
Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054
Rooms at the Hyatt Regency Santa Clara are sold out.
Rooms available at:
Biltmore Hotel & Suites
2151 Laurelwood Road
Santa Clara, CA 95054
Book your room for $225 single or double plus tax or call (800) 255-9925 or (408) 988-8411 and reference USENIX Association or Billing ID #32992. Room rate includes WiFi and complimentary shuttle to the Hyatt Regency Santa Clara.
Questions?
About SREcon?
About the Call for Participation?
About the Hotel/Registration?
About Sponsorship?
help promote
usenix conference policies
Beyond Repair: Proactive Maintenance Work at Scale
Romain Komorn, Facebook
FBAR has enabled Facebook's production engineering teams to automate break/fix responses to many of the single-host events, such as hardware failures or application daemon crashes, that occur on a daily basis, leaving engineers to focus on larger, more complex and interesting problems.
To provide more opportunities and time for engineers to take on more meaningful work, the team developed automated responses for the less frequent, but larger-scale proactive work necessary to maintain a healthy infrastructure. This work takes on different shapes, including top-of-rack switch replacements, disruptive BIOS/firmware updates, or work on power supply (backup or primary).
This talk will use a (fictitious) example of a set of racks undergoing maintenance to give an overview of how the automation provided by FBAR to handle single-host repairs was expanded to cover proactive maintenance work, including an explanation of the automated maintenance process, the API engineers implement to automate the work, and the way it interfaces with humans when automation won't (or fails to) work. It will also include a few small lessons we learned along the way, and explain why a simple approach covers much of the use cases with few drawbacks.
Romain is a manager in Facebook's Production Engineering organization, working on the team that maintains Facebook's Auto Remediation tool (FBAR). He originally came in to Facebook as part of the Site Reliability Operations team and has spent the last five years focused on safely automating operational tasks. Most recently, this has taken the shape of creating new tooling that allows us to automate maintenance affecting multiple machines (and multiple racks) at a time. The team has spent the last two years refining the process and keeping automation simple while covering the majority of use cases.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Romain Komorn},
title = {Beyond Repair: Proactive Maintenance Work at Scale},
year = {2016},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = apr
}
connect with us