Yoann Fouquet, Booking.com, and Paola Martinucci, Mollie.com
Large disasters can be due to equipment failures, user errors, natural disasters, malware and other unexpected events. At Booking.com, we have established a program to test the impact of these disasters and the recovery mechanism that we have in place. Those tests started years ago with simple region evacuation in normal conditions, and later expanded to injection of latency at the network level, packet dropping, cut of inter-datacenter connection, cut of power feed or even region-wide shutdown.
In this talk, we are not giving a master class on disaster recovery testing, but rather sharing 4 years of knowledge acquired during this program: the improvement of the reliability of our platform, the organisational impact, the automation created for or after the tests, … but also a few things that actually went wrong. We will finally discuss how this was applied to mitigate real incidents that have happened since the start of the project.
Yoann Fouquet, Booking.com
Yoann Fouquet is a Senior Site Reliability Engineering Manager, with experience in building and operating resilient applications at high-scale. He joined Booking.com in 2018, where he is supporting company core services on performance, reliability, disaster recovery and security topics with a continuous focus on making SRE practices scale through efficient tooling and processes.
Paola has recently joined the Engineering Team as Technical Program Manager at Mollie.com in Amsterdam.Paola Martinucci, Mollie.com
Paola is an enthusiastic and passionate woman in Tech, and mainly a happy family woman and mother of 3 wonderful children. Venezuelan-italian Industrial Engineer, she has gathered a long combined experience as manufacturing planner and controller, project manager, in administration and has a strong customer oriented and team-work mindset. In the past two and a half years, she had the opportunity to be part of the transformation of the AZ Failover Project in Booking.com, into the established large-company scale Disaster Recovery Testing Program that it is today.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Yoann Fouquet and Paola Martinucci},
title = {Disaster Recovery Testing at Booking.com},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}