while (true) do; How hard can it be to keep running?

Caskey L. Dickson

LISA16 CFP button

while (true) do; How hard can it be to keep running?

LISA: Where systems engineering and operations professionals share real-world knowledge about designing, building, and maintaining the critical systems of our interconnected world.

The LISA conference has long served as the annual vendor-neutral meeting place for the wider system administration community. The LISA14 program recognized the overlap and differences between traditional and modern IT operations and engineering, and developed a highly-curated program around 5 key topics: Systems Engineering, Security, Culture, DevOps, and Monitoring/Metrics. The program included 22 half- and full-day training sessions; 10 workshops; and a conference program consisting of 50 invited talks, panels, refereed paper presentations, and mini-tutorials.

Mini Tutorial

Wednesday, November 12, 2014 - 2:00pm-3:30pm

Caskey L. Dickson, Google, Inc.

Caskey Dickson is a Site Reliability Engineer/Software Engineer at Google, where he works on writing and maintaining monitoring services that operate at "Google scale." Before coming to Google he was a senior developer at Symantec, wrote software for various internet startups such as CitySearch and CarsDirect, ran a consulting company, and taught undergraduate and graduate computer science at Loyola Marymount University. He has an undergraduate degree in Computer Science, a Masters in Systems Engineering, and an MBA from Loyola Marymount.

BibTeX

@conference {209057,
author = {Caskey L. Dickson},
title = {while (true) do; How hard can it be to keep running?},
year = {2014},
address = {Seattle, WA},
publisher = {USENIX Association},
month = nov
}

Download

Description:

At Google we have more than a handful of servers and must leverage our administration time as effectively as possible. Between custom in-house software and off-the-shelf daemons, there are many parts to running a reliable, distributed, redundant service. Most fundamental is running the software and keeping it running. Through reboots, crashes, upgrades, downgrades, bugs, canaries and outages, myriad forces conspire to end your process and keep it stopped or worse, keep it alive but not functioning.

There exists init, upstart, rc scripts, cron, at and more that provide mechanisms to run programs unattended, but each of them can fail in different ways. When you have dozens or hundreds of servers they will fail in many different ways. I will discuss the obvious and not-so-obvious failure modes of popular packages like upstart and cron, as well as how we’ve worked with and around them to ensure that when we run a daemon it stays running. Some special emphasis will be given to how virtual hosts create new challenges that can trip up launch strategies and services written for bare metal.

Who should attend:

Administrators who manage fleets of virtual or physical machines that have essential daemons that are managed using automated tools will benefit from the simple and reliable technique described.

Take back to work:

A simple and reliable technique to run daemons (services) reliably on large fleets of machines that can be upgraded and rolled back in an automated fashion.

Topics include:

Packaging configurations for distribution
Process management
Recovery from failure
Init script design
Pitfalls of pidfiles
Why daemonization is bad for you
Roll forward vs. roll back
Canaries and monitoring

connect with us

while (true) do; How hard can it be to keep running?

LISA: Where systems engineering and operations professionals share real-world knowledge about designing, building, and maintaining the critical systems of our interconnected world.

Caskey L. Dickson, Google, Inc.