CI is Alerting

Shifting left means doing things earlier in development.

February 21, 2023

Opinion

Authors:

Article shepherded by:

Rik Farrow, Laura Nolan

During early stages of software development, developers may add lots of alerts, collecting and reporting on lots of information. Some of this information may not be useful as the system moves from the development to production phase. Using principles from continuous integration (CI) will guide developers to produce helpful, rather than superfluous, alerts.

The DevOps movement and recent trends in software development have talked a lot about shift left as shorthand for doing things earlier in the developer workflow. Shift left on security means not putting off thinking about security until the last minute. Shift left on testing means, getting test results to developers earlier. The underlying idea here is that the earlier in the developer workflow we identify problems, the cheaper it is for the organization to fix them. That is, finding a security bug before you've even committed your change is a lot cheaper than debugging it during a rollout, which is itself cheaper than discovering it after it's been exploited in production. The same applies to all forms of critical defects, be they logic problems, program crashes, efficiency issues, or security. This isn't to say that every nit has to be discovered in-editor and fixed immediately, merely that the issues that affect the fitness of a release should be identified and resolved as early as possible. (For more on this topic, watch my recent ACM tech talk on Tradeoffs in the Software Workflow.)

With that in mind, we'd like to make the following bold claim: CI — for instance GitHub Actions or Jenkins — is what you get when you shift left on production alerting, as with Google’s planet-scale timeseries database (https://research.google/pubs/pub50652/).

CI systems are a basic build-and-test automation: build the code, and run the tests as often as is reasonable. Presubmit tests with CI may be triggered when a CL is sent for review, asking, "Would the tests still pass if this changelist (CL) was submitted?" Post-submit tests ensure that no matter what happened during the submission of a CL, the tests are still run on what was committed. In post-submit cases, we rely on CI to tell us, "Are the tests still passing at head?"

Both CI and alerting serve the same overall purpose in the developer workflow: identify problems as quickly as reasonably possible. CI emphasizes the left (early) side of the developer workflow; it enables faster feedback loops, and ideally uses high-fidelity tests to prevent production problems. Alerting lives on the slow end of the diagram and catches problems by monitoring metrics and reporting when they meet some condition. Alerting happens (most obviously) in production, but also can be in earlier release phases as well, such as in your canary or staging releases. Some teams at Google, such as Google Drive, have multiple pre-production environments with mature test data, and use alerts in these environments the same way as in prod.

Let's think about this shared goal of "identifying problems as early as is reasonably possible". For CI, this means catching bugs before they make it into the next phase of the build. For alerting, this means fixing bugs before they become user-impacting. Note the "reasonably possible" part—this is an important clause because we don't want to over-optimize or over-engineer. We don't want to write a testing system so fine-tuned that it can catch every possible bug immediately, because such a hypothetical system would likely consume all our engineering resources to write and maintain, and inevitably produce many false positives. Similarly, we don't want to set off an alert on everything, and anything, that could potentially affect a user—many bugs will be transient, or so small or difficult to track down that they don't justify the effort to fix.

When implemented well, a good alerting system helps ensure that your production environment is in good shape—SLOs are being met and excursions are detected quickly. When implemented well, a good CI system helps ensure that your build is in good shape—the code compiles, tests pass, we could deploy a new release if we needed to. Best practice policies in both spaces focus a lot on ideas of fidelity and actionable alerting: tests should only fail when the important underlying invariant is violated, rather than because the test is brittle or flaky. Similarly, alerts should only fire when the alert is actionable: a flaky test that fails every few CI runs is just as much of a problem as a spurious alert going off every few minutes and generating a page for the on-call. If it isn't actionable, it shouldn't be alerting.

Mapping Monitoring and Alerting to Testing and CI

This thought exercise leads us to start drawing parallels between the alerting and CI domains. For instance, are integration tests and canary deployments simply two names for the same thing? In a world with more continuous deployment, what's the difference between continuous integration on your large-scale integration tests vs. a canary deployment? In practice the major difference is in the setup: are we using test backends or production versions? Similarly, with high-fidelity test data in your staging instance, what's the difference between reporting large-scale integration test failures in staging vs. real alerts on those failures manifested in production? CI and Alerting experts seem to acknowledge that these systems are effectively the same tech serving the same purpose.

The CI as Alerting comparison goes deeper than that. At a more fine-grained level, we see comparisons in thinking, and best-practices between these two topics. In the scholarship for both domains, there's a dichotomy between localized signals – unit tests, monitoring of isolated statistics / cause-based alerting – and cross-dependency signals – integration and release tests, black-box probing. The most valuable and highest fidelity indicators of whether the system is working are the larger/aggregate signals, but we pay for that in flakiness, resource cost, and difficulty in digging down into root-causes for debugging. Anything we can reasonably diagnose earlier will be cheaper to resolve.

Similarly, not only do we need to worry about flaky tests in the code base, but production also suffers from poorly structured alert rules that fire without being actionable. For instance, a poorly-thought out or insufficiently-predictive alert triggering in production can unnecessarily indicate a bad rollout, slowing deployment, paging Site Reliability Engineers (SREs), and confusing systems like Canary Analysis Service (https://research.google/pubs/pub46908/).

Brittleness, or Cause-Based Alerts as Unittests

In discussions of alerting, a prominent idea is to focus on black-box and end-to-end probing to generate alerts, because in practice many cause-based alerts are hard to tune and perhaps non-actionable in production environments. Imagine an SRE conversation: "We got a 2% bump in retries in the past hour, which put us over the alerting threshold for retries-per-day." "Is the system suffering as a result? Are users noticing increased latency or increased failed requests?" "No." "Then … ignore the alert I guess. Or update the failure threshold." In this scenario, the alerting threshold for this cause-based alert is brittle: someone arbitrarily put a line in the sand to say, "this value should never exceed X," without any fundamental truth to that assertion: it's weakly correlated with the thing that matters – whether users are going to notice any degradation of service.

There's a direct connection between that common antipattern in cause-based alerting and brittle failures in unit tests. Imagine a software engineer conversation: "We got a test failure from our CI system. The image renderer test is failing after someone upgraded the JPEG compressor library." "How is the test failing?" "Looks like we get a different sequence of bytes out of the compressor than we did previously." "Do they render the same?" "Basically." "Then … ignore the alert I guess. Or update the test." In this scenario, a test failure and CI alert is caused by brittle dependence expressed on irrelevant surface features of the underlying system, the image compressor. The team doesn't actually care about the specific sequence of bytes that is output by the compressor: all that matters is that the bitmap produced by decoding that as a JPEG is well encoded and visually similar.

In both the Alerting and CI cases, when there isn't enough high-level expressive infrastructure to easily assert the thing that matters, teams will naturally tend toward the easy-to-express-but-brittle thing. If you don't have an easy end-to-end probe, but you do make it easy to collect some aggregate statistics, teams will write threshold alerts based on arbitrary statistics. If you don't have a high-level way to say, "Fail the test if the decoded image isn't roughly the same as this decoded image," teams will instead build tests that assert that the byte streams are identical. Brittleness reduces the value of our testing and alerting systems by triggering on false-positives, but also serves as a clear indication of where it may be valuable to invest in higher level design.

This isn't to say that cause-based alerts have no value, or even that brittle tests have no value. In the event of an actual failure, having more detail available to debug with can be useful. When SREs are debugging an outage, it may be useful to have information of this form: "An hour ago users started experiencing more failed requests. Around the same time the number of retries started ticking up. Let's start investigating there." Similarly, brittle tests may still provide extra debugging information: "The image rendering pipeline started spitting out garbage. One of the unit tests suggests that we're getting different bytes back from the JPEG compressor. Let's start investigating there." Weak signal may still be useful signal.

However, brittle tests arguably are more problematic than non-actionable alerts. In a CI world, we currently lack the ability to "ack and silence" a failed test. SREs understand that there are cause-based alerts and alerts that are not always actionable. Furthermore, alerts might fire due to some external event such as a Formula One race or a slashdot post. Monitoring and alerting systems have a UI in place to track the state of each alert, and turn it off if it's non-actionable … without deleting it. Now, if this alert it turns out to be useful as debugging input in some higher-level investigation, we are happy it is there. If it is not useful, we sympathise with the person paged, and move on. In the production world we have higher stakes but clearer indicators of success. It’s much easier to classify an alert as a false-positive if the service is still running smoothly, than it is to conclusively decide that a test failure is due to brittleness. SRE practices and philosophy have long understood that monitoring and alerting is a matter of noisy signals analysis. CI practices are only slowly moving to a more nuanced approach and away from the belief that every individual test is equally valuable and must always be passed.

How Could We Move Forward?

While monitoring and alerting are considered a part of the SRE / production management domain, where the insight of Error Budgets is well understood, CI comes from a perspective that still tends to be focused on absolutes. We often encounter teams with stated goals of "100% passing rate on tests, or that have policies where when tests start failing, nobody can submit until the tests are fixed.

Let’s reconsider our objective, based on the idea of error budgets. If we really commit to the realization that CI is just the left-shift of Alerting, that starts to suggest ways to reason about those policies, perhaps finding a better way to conceptualize:

100% passing rate on CI, just like 100% uptime, is awfully expensive. If that's actually your goal, one of the biggest problems is going to be the race condition between testing and submission – sometimes known as mid-air collisions (https://abseil.io/resources/swe-book/html/ch23.html). As it stands, in most organizations it is possible to kick off tests, wait for those results to come in, and then submit. So long as the files in your change haven't been modified in the interim, you can submit. In theory, a different set of files may have (arbitrarily) changed the correctness of your change and invalidated test results. The only way to avoid this in theory is to globally lock the codebase at the start of presubmit tests. This isn't done, and is definitely not recommended, because contention for that lock would be a massive detriment on version control throughput.

Treating every alert as an equal cause for alarm is not generally the right approach. If an alert fires in production, but the service isn't actually impacted, silencing the alert is actually the right choice. Humans can't be in crisis mode forever, and we shouldn't necessarily delay deployment of new features just because some buggy or ill-conceived alert started firing. We should investigate those alerts, yes, but not every piece of software is well-designed or perfectly actionable. The same is true for test failures: until our CI systems learn how to say "This test is known to be failing for irrelevant reasons," we should probably be a little more liberal in accepting changes that disable a failed test, or pull a given test out of CI for a while. We can investigate and re-enable later. Remember: not all test failures are indicative of upcoming production issues. We find this is quite often true of larger scale integration tests in particular.

Reconsider policies where, if not all CI results are green, no commits can be made. When CI is reporting an issue, that should definitely be investigated before letting people commit or compound the issue. But if the root cause is well-understood and clearly would not affect production, blocking commits is an unreasonably harsh position to hold to. If staging is known to be hermetic, it should be reasonable to allow builds which are passing unit tests and release qualification tests (but failing integration tests) to attempt staging to see whether they're operationally stable.

This "CI is Alerting" comparison is still novel, and we're still figuring out how to fully draw parallels. Given the higher stakes involved, it's perhaps unsurprising that SRE has put a lot of thought into best practices surrounding monitoring and alerting, while CI has traditionally been viewed as a bit more of a luxury feature. For the next few years, the task will be to see where existing SRE practice can be reconceptualized in a CI context to help explain testing and CI … and perhaps where best practices in testing can help clarify goals and policies on monitoring and alerting.

Article Categories:

SRE

Programming

Last updated February 22, 2023

Authors:

Titus (@tituswinters) is a principal Engineer at Google, where he has worked since 2010. He is the library lead for Google’s C++ codebase: 250 million lines of code that will be edited by 12K distinct engineers in a month. That unique scale and perspective has informed all of his thinking on the care and feeding of software systems, especially shown in the book “Software Engineering at Google” (aka “The Flamingo Book”). His recent areas of interest include technical debt, software engineering education, and effective software testing.

titus@google.com