Embrace Fleet Reboots and Make Them Boring

Thursday, 31 October, 2024 - 11:0011:40 GMT

Everton Didone Foscarini, Cloudflare

Abstract: 

Server reboots bring up mixed sentiments. Some want to say “My kernel is stable, it does not crash with a thousand days uptime”, others understand that you are running a system with a thousand days of accumulated vulnerabilities.

In Cloudflare we believe that high uptimes are bad, and while the reboot automation was being developed, we were hit by a kernel+BIOS bug that caused a high rate of node crashes, and encouraged the quick adoption of reboot automation, prompting us to implement better tooling to deploy fleet changes over reboots, creating multiple reboot queues for different workloads, load-based maintenance windows and more.

We achieved monthly reboots for our edge fleet while keeping the clusters online and serving customer-facing traffic, unlocking our ability to iterate fast on Linux Kernel versions and OS releases, ensuring we are not running outdated library versions in hosts not rebooted for a thousand days.

Everton Didone Foscarini, Cloudflare

Working on Internet-based services using Linux since 2003, joined Cloudflare in 2017 and helped to scale Edge location operations from 102 to 320 cities, creating tooling to manage services lifecycle and server reboots.

BibTeX
@conference {302243,
author = {Everton Didone Foscarini},
title = {Embrace Fleet Reboots and Make Them Boring},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video