Finding the Capacity to Grieve Once More

Wednesday, 30 October, 2024 - 11:0011:40 GMT

Alexandros Kosiaris, Wikimedia Foundation

Abstract: 

At Wikipedia, we handle unpredictable traffic spikes, especially during notable deaths, which can cause severe outages. Despite believing we had mitigated this issue years ago, a major outage occurred in 2020 due to a notable death and a DDoS attack, leading to the realization that our platform needed further improvements. Over the years, we conducted investigations and implemented numerous fixes, educating new SREs about our platform's unique constraints. Two years ago, following the death of Elizabeth II, our system successfully handled unprecedented traffic without outages, demonstrating our platform's resilience. This story highlights the infrastructure improvements that allowed us to manage traffic surges and the emotional journey of regaining the capacity to properly grieve significant losses.

We heavily rely on open source, and our code is public, making our solutions accessible to everyone.

Alexandros Kosiaris, Wikimedia Foundation

A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), turned SRE, Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia Foundation, he has pushed forward for more virtualization, better orchestrated microservices and platform developments for their execution.

BibTeX
@conference {302257,
author = {Alexandros Kosiaris},
title = {Finding the Capacity to Grieve Once More},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}