Joseph Lynch, Netflix
Netflix runs a complex architecture supporting hundreds of different types of devices connecting from all over the world at all times. For various reasons at various times, load on these systems shifts significantly in pattern and magnitude, sometimes by multiple orders of magnitude in just a few minutes. When demand shifts, dozens of edge gateways, thousands of microservices, and tens of thousands of caches and databases have to weather the load shift while maintaining a high quality of service for our users.
In this talk, we will start with understanding how the four-region full-active architecture of Netflix's streaming control plane gives us the levers to shape and prioritize traffic. Techniques like balancing load and at key times unbalancing it or using partial or complete failover and shifting help us mitigate demand shifts.
Next, once load has entered one of our regions, we will see a combination of intelligent pre-scaling with automated service buffer management paired with reactive measures such as load shedding and rapid autoscaling to best bring available capacity supply to bear. For some types of demand shifts, we have to make hard tradeoffs between system stability and our ideal user experience, and choose to smartly degrade the service while maintaining the highest quality of experience we can. We will dive deep into these techniques with examples and tradeoffs.
Finally, we will touch on how the underlying data architecture makes all of this possible, and briefly what resilience techniques we use to keep our stateful systems available during load increases. For example, we will cover the use of data gateways with built-in resilience techniques, capacity planning, sharding, and thoughtful use of caching.

Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building resilience features and automated capacity management into the Netflix fleet.

author = {Joseph Lynch},
title = {Techniques Netflix Uses to Weather Significant Demand Shifts},
year = {2025},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}