How to Not Get Paged: Managing On-Call to Reduce Outages

Fairfax Room

Half Day Afternoon
1:30 pm5:00 pm
LISA16: Culture
Description: 

People think of "on call” as responding to a pager that beeps because of an outage. In this class, you will learn how to run an on-call system that improves uptime and reduces how often you are paged. We will start with a monitoring philosophy that prevent outages. Then we will discuss how to construct an on-call schedule—possibly in more detail than you've cared about before—but, as a result, it will be more fair and less stressful. We'll discuss how to conduct “fire drills” and “game day exercises” that create antifragile systems. Lastly, we'll discuss how to conduct a postmortem exercise that promotes better communication and prevents future problems.

Who should attend: 

Managers or Sysadmins with oncall responsibility

Take back to work: 
  • Knowledge that makes being on call more fair and less stressful
  • Strategies for using monitoring to improve uptime and reliability
  • Team-training techniques such as "fire drills" and "game day exercises"
  • How to conduct better postmortems/learning retrospectives
Topics include: 
  • Why your monitoring strategy is broken and how to fix it
  • Building a more fair on-call schedule
  • Monitoring to detect outages vs. monitoring to improve reliability
  • Alert review strategies
  • Conducting “fire drills” and “game day exercises”
  • "Blameless postmortem documents"
Presentation Type: 
Training