What could be simpler than a health check endpoint that just returns a HTTP 200 when called? But health checks aren’t simple at all. Health checks are a critical signal in orchestration systems, and when things go wrong, they can cause havoc. There are systemic patterns that can happen when we use health checking in certain ways, and systems thinking can help us understand some of the pitfalls of health check-driven orchestration.
Consider the humble sidewalk (or, for my fellow European readers, the footpath). Streets that feature local businesses see foot traffic throughout the day — and well into the evening, if some of those businesses are restaurants or bars. This creates a critical mass of ‘eyes on the street’, which means that those streets are likely to be safer. Residential windows overlooking the street also help to promote safer streets. Conversely, residential-only areas, particularly those that feature sidewalks and spaces that are publicly-accessible but not well-trafficked or overlooked, tend to become unsafe. These are reinforcing phenomena (vicious and virtuous cycles): people are happy to walk around what they feel are safe streets, or to sit on their porches, but not to walk through areas that feel dangerous. The perception of danger thus reduces ‘eyes on the street’, which further increases danger.
Another reinforcing phenomenon: in areas with streets that feel safe to walk on, with a good density of residences mixed with other uses that bring people to the area, you tend to get a lot of small businesses catering to the people using the sidewalks. In areas with local businesses, residents can form a loose web of casual acquaintanceships: talking to the shopkeeper, greeting their neighbor at the coffee kiosk in the local park. In areas that are residential-only, without the intricate web of local businesses, acquaintanceships are harder to form and people are far less likely to know other people who live nearby. As a consequence, older urban areas tend to have robust local politics — and thus, to be able to effect positive change in their neighborhoods, which can reinforce their status as vibrant districts — while modern planned districts lack engagement with local politics.
The physical layout of streets and sidewalks creates public safety and functioning local political organizing. Lose your overlooked sidewalks and your local small businesses and you seriously impair the ability of residents to organize effectively. Nobody planned traditional urban areas to work this way: effective local politics is an emergent property of the system that arises from all these overlapping interactions, as are safe streets. Change the street layout to remove the “eyes on the street”, lose the small businesses — or replace them with a busy, anonymous centralized store — and the system won’t work the same way.
On the surface, it may seem that people can still get from place to place and buy goods; these primary purposes will still be fulfilled, but the secondary effects of street surveillance and forming human networks will be gone. It won’t be immediately obvious that something has been lost. The city is a complex system, and these emergent properties depend on the structure of that system.
These examples of systems thinking — understanding how the structure of the city affects outcomes — come from Jane Jacobs’ The Death and Life of Great American Cities [1]. Jacobs’ book is about urban planning — in particular, critiquing the tendency of mid-twentieth-century high modernist urban planners to raze functional urban areas and replace them with inward-looking closed ‘garden city’ areas. Jacobs’ relentlessly focuses on showing how functional neighborhoods in a city work as systems, with each component of the neighborhood’s built environment and its people interacting with other components to create areas that work well — or that do not. Jacobs then contrasts these functional urban systems with recently-built dysfunctional neighborhoods to identify what these newer zones lack. The overall message is that neighborhoods don’t work because of any single element: they work because of the interactions between the elements.
Like an urban neighborhood, our distributed software systems are complex systems: they consist of many subsystems, each with state, and each able to affect other subsystems in various ways. In distributed systems — as in a city — the structure of the system influences outcomes, in ways that can be difficult to predict.
Distributed computing systems are all unique, just as street layouts are, but they have certain common properties that influence outcomes in ways that we do understand: because we’ve seen it in other systems that share some of those characteristics. So, while it is a tenet of systems theory that we can’t fully predict the behavior of a complex system, we can transfer some experiences from system to system, just as planners can from district to district
The production systems equivalent of Jacobs’ sidewalk — the foundation of a well-functioning urban system — may be the health check: sending a request to verify whether a particular target instance is functioning and capable of serving traffic (this might be as simple as an endpoint that always returns a HTTP 200 status code to indicate that the process is running, or a more sophisticated check that takes into account the status of a variety of dependencies or performs an expensive end-to-end computation). There are multiple kinds of health check. Kubernetes defines three types of health check: startup checks indicate that a container has started successfully; liveness probes indicate that a container is running and not in an irrecoverable state (such as a deadlock); readiness probes indicate that a container can receive traffic.
Just as the sidewalk is the linchpin of a web of uses that have a major influence on safety and political outcomes, the health check operates as the nexus of a set of processes that have a major bearing on system reliability. Health checking is a fundamental part of a broader orchestration problem that includes load balancing, service discovery, alerting, and change management: a very broad swathe of what system operators manage on a daily basis.
In load balancing, we use health checks to determine which of a set of potential targets are capable of serving a request (we might do this directly, or indirectly via a service discovery tool). Service discovery tools (such as Consul or AWS Cloud Map) use health checks to keep their endpoint lists current. It is common (and good practice) to alert if a significant proportion of a service’s instances become unhealthy. Finally, rollouts and scaleup processes often use health checks to ensure that newly created (or modified) instances of a service are capable of serving - often only a few instances will be modified or replaced concurrently, limiting the damage to a service in the case that new (or modified) instances are not able to serve traffic. One of the things that makes Kubernetes popular is that it can do most of this kind of orchestration for you out of the box.
Health checking has become much more important than it used to be because in modern distributed system, failure is normal. In the good old days of the 2000s and before, it wasn’t the norm for backends to become unhealthy or to disappear. Things tended to stay stable, unless you were doing a rollout (which didn’t happen very often) or some other kind of maintenance. These days, we do frequent rollouts, and we constantly create and destroy infrastructure. This means that health checking is vital now, as otherwise we would constantly send requests to instances that no longer exist.
A sidewalk is not just a slab of concrete, and health checking is not just an endpoint that tells you whether an instance of your service is able to serve or not. From a systems point of view, health checks are a mechanism that transmits signals to different parts of the system — signals which may influence how requests are distributed to clusters and backends, whether rollouts continue or are paused, and perhaps whether instances should be replaced. A useful systems analysis technique is to think about what happens when these sorts of signals in distributed systems become unavailable, or give incorrect information, or when all of them change state unexpectedly. The nature of health checks as a critical orchestration signal, used in all kinds of ways in our automation, makes them a particularly important kind of signal to analyze.
Think about what would happen if we rolled out a change that broke our health checks for a critical service? This might be a result of a code or configuration change to the service — in which case, we would hopefully automatically halt the rollout — or it could happen as a result of a configuration change applied to the service that performs the health checking, such as a load balancer. If such a change was rolled out to a significant proportion of our systems, it could well cause an outage — even if the service was actually capable of serving. Health checking then creates new ways that our systems can fail — and fail catastrophically. Let’s examine some more examples.
The Laser of Death
In load balanced systems, we often use health checks to determine which backends can serve requests. This works well most of the time — we avoid sending requests to instances that are temporarily overloaded or have just been turned down. However, health checks can sometimes make an overload situation worse by causing a phenomenon affectionately called the ‘Laser of Death’.
In a system that is under significant but manageable load, a subset of instances become unhealthy for some reason. Load balancers stop sending requests to those instances, and instead, send slightly more load to other instances. Load balancing is never quite perfect, and a couple of these instances now become overloaded, and fail their health checks, resulting in more load to the remaining healthy instances, some of which in turn become overloaded. The load balancers can start to act as a ‘laser of death’, overloading whichever subset of hosts is currently passing health checks. This is an example of a case where the health checking signal has become unhelpful because the load balancing (driven by health checking) is actually causing the health check signals to change state, because of overload.
There are mechanisms to avoid this kind of failure: various kinds of circuit-breaking or loadshedding approaches are possible. Envoy Proxy has a mechanism called ‘panic routing’ which sets a threshold percentage of healthy hosts. If the percentage of healthy hosts in a particular cluster drops below the threshold, Envoy will begin to ignore health check status and load balance equally across all hosts in the cluster.
The Killer Health Check
It is possible to set up automation based on a health check that causes unhealthy instances of your service to be replaced. This makes perfect sense under normal conditions — where only an occasional instance becomes unhealthy because of some problem with the underlying hardware, perhaps. If an entire service is unhealthy, then replacing instances with new broken instances is not productive. The issue here is not necessarily that the health check signal is wrong — it’s that the signal does not unequivocally mean that the issue is an instance-specific fault and that replacement is an appropriate action.
In the chaos of a major outage, this kind of automation can increase confusion. Operators may not remember that the ‘killer health check’ automation exists — and then figuring out why the instances they are trying to troubleshoot keep being turned down creates another problem for them.
Another unfortunate effect of this kind of automation is that it can replace warmed-up hosts, which have successfully fetched dependencies and configuration and cached useful data, created connection pools, and so on, with brand-new hosts. These new hosts may take some time to be capable of serving at the same rate as the previous hosts. Bringing up new instances can create load on other services, which might have effects elsewhere (for example, in systems that serve configurations to new hosts or are otherwise involved in provisioning).
If the destroyed instances were irrecoverably unhealthy, then this may be reasonable. However, if the destroyed instances were reporting themselves unhealthy because one of their dependencies was unavailable, then it is probably not useful: and could cause problems elsewhere, particularly if the work involved in constantly recreating instances causes saturation in systems that might be required in order to upsize other services. This is an example of why it is important to think holistically about health checking and orchestration; and why what makes sense when a small number of instances become unhealthy can be be harmful when many or most instances are unhealthy.
Recovery from Metastable Failures
Sometimes systems can get into a state where overload becomes self-sustaining (for example, if database load is excessive because a cache that fronts that database is empty, this can cause a high rate of database lookup failures and means that the cache cannot fill up enough to sustain a useful hit-rate). This kind of failure is known as a metastable failure (or cascading failure) [2].
In these cases, we usually need to significantly reduce traffic to the system somehow, and gradually increase the load until the system returns to a normal state. Health checking can make recovery harder in two ways: firstly, by concentrating load on small subsets of hosts that are responding to readiness checks, causing them to become unhealthy; and secondly, if automation is in place to replace unhealthy instances, by causing overloaded hosts to be replaced with new hosts that aren’t ‘warmed up’ and thus can likely serve less traffic. This is a combination of the Laser of Death and the Killer Health Check failure modes.
If health checking is contributing to a metastable failure situation, disabling health checking entirely is often useful during recovery. It can be re-enabled when the system becomes stable again.
Staleness, Correctness, and Speed
Rollouts are an example of the need to think about orchestration, service discovery and load balancing in a holistic way. If we have a service with N instances, and we turn down and replace all of them with N other instances in a very short period of time — of the order of seconds, rather than minutes — then in many cases, our systems will have problems, even if the new instances are ready and able to serve traffic.
Each new instance of the service needs to be registered with whatever service is managing service discovery. Service discovery is a problem that is best considered an eventually-consistent problem. Slightly-stale results (a few seconds old) should normally be usable, but if an entire service can be replaced in seconds, then slightly-stale results may now become entirely incorrect and clients may not be able to locate any healthy instance of the service (here is another example of how signals may break — through simply becoming stale). Rollout speed cannot be considered in isolation from the rest of the orchestration, including configuration such as DNS TTLs and any other caching layers.
Another staleness problem that can arise is that health checks are typically done periodically. If an instance becomes unhealthy between health checks, then it may still receive requests for some period of time, potentially creating user impact. Implicit health checking (such as Envoy Proxy’s outlier detection) can help to create a higher-fidelity signal.
Scaling to Serve Health Checks
Health checks can often become very expensive at scale [4]. This is particularly when the number of clients and servers grows over time - health checking, in its simplest form, scales as a M clients x N servers problem. It is not unheard of for a double-digit percentage of resource usage for a service to be related to health checking. There are a number of techniques that can be used to mitigate this effect, such as centralizing health checking, caching health checks for a short period, or breaking large systems into smaller subsets.
This means that the resource cost of health checking can sometimes distort signals used to determine utilization of a system, and thus, lead to an incorrect decision about whether scaling out is required. This is particularly dangerous as scaling a serving system will not reduce the load imposed by health checking, because this is determined solely by the number of clients performing health checks. In pathological cases, where a majority of load is a result of heath checking, scaling out will not reduce utilization much at all. It is essential to track the percentage of load that is a result of health checking and to maintain this at a reasonable level.
Health checks seem simple, but the systems that make decisions based on health checks have all sorts of subtle properties. Just as town planners need to think about the second-order consequences of street layout in order to build safe and vital urban areas, distributed systems operators need to think about the second-order consequences of health check-driven orchestration behaviors.
A common theme with orchestration systems that use health checks is that they work well when things are generally stable and you are dealing only with the normal background noise of occasional failures and instance replacement. In these contexts, health checks are very helpful and increase system reliability. However, in larger-scale disturbances, where many instances are unreachable or unhealthy, health checks — and the behaviors that they trigger in our orchestration systems — can make things worse.
So remember: your orchestration isn’t just for dealing with errant unhealthy instances. It also needs to work for you in the worst outage you can imagine. At a minimum, you should consider having ready-to-go processes for quickly disabling your health checking, as counterintuitive as that seems.
Just as safety and vibrant local politics are emergent properties of vital city districts, arising from how the components of those districts interact with each other, reliability in a distributed system is an emergent property that is a function of how the entire system works together. When it comes to health checks and orchestration, think about the whole system, not just disconnected parts of it.