During early stages of software development, developers may add lots of alerts, collecting and reporting on lots of information. Some of this information may not be useful as the system moves from the development to production phase. Using principles from continuous integration (CI) will guide developers to produce helpful, rather than superfluous, alerts.
The DevOps movement and recent trends in software development have talked a lot about shift left as shorthand for doing things earlier in the developer workflow. Shift left on security means not putting off thinking about security until the last minute. Shift left on testing means, getting test results to developers earlier. The underlying idea here is that the earlier in the developer workflow we identify problems, the cheaper it is for the organization to fix them. That is, finding a security bug before you've even committed your change is a lot cheaper than debugging it during a rollout, which is itself cheaper than discovering it after it's been exploited in production. The same applies to all forms of critical defects, be they logic problems, program crashes, efficiency issues, or security. This isn't to say that every nit has to be discovered in-editor and fixed immediately, merely that the issues that affect the fitness of a release should be identified and resolved as early as possible. (For more on this topic, watch my recent ACM tech talk on Tradeoffs in the Software Workflow.)
With that in mind, we'd like to make the following bold claim: CI — for instance GitHub Actions or Jenkins — is what you get when you shift left on production alerting, as with Google’s planet-scale timeseries database (https://research.google/pubs/pub50652/).
CI systems are a basic build-and-test automation: build the code, and run the tests as often as is reasonable. Presubmit tests with CI may be triggered when a CL is sent for review, asking, "Would the tests still pass if this changelist (CL) was submitted?" Post-submit tests ensure that no matter what happened during the submission of a CL, the tests are still run on what was committed. In post-submit cases, we rely on CI to tell us, "Are the tests still passing at head?"
Both CI and alerting serve the same overall purpose in the developer workflow: identify problems as quickly as reasonably possible. CI emphasizes the left (early) side of the developer workflow; it enables faster feedback loops, and ideally uses high-fidelity tests to prevent production problems. Alerting lives on the slow end of the diagram and catches problems by monitoring metrics and reporting when they meet some condition. Alerting happens (most obviously) in production, but also can be in earlier release phases as well, such as in your canary or staging releases. Some teams at Google, such as Google Drive, have multiple pre-production environments with mature test data, and use alerts in these environments the same way as in prod.
This thought exercise leads us to start drawing parallels between the alerting and CI domains. For instance, are integration tests and canary deployments simply two names for the same thing? In a world with more continuous deployment, what's the difference between continuous integration on your large-scale integration tests vs. a canary deployment? In practice the major difference is in the setup: are we using test backends or production versions? Similarly, with high-fidelity test data in your staging instance, what's the difference between reporting large-scale integration test failures in staging vs. real alerts on those failures manifested in production? CI and Alerting experts seem to acknowledge that these systems are effectively the same tech serving the same purpose.
The CI as Alerting comparison goes deeper than that. At a more fine-grained level, we see comparisons in thinking, and best-practices between these two topics. In the scholarship for both domains, there's a dichotomy between localized signals – unit tests, monitoring of isolated statistics / cause-based alerting – and cross-dependency signals – integration and release tests, black-box probing. The most valuable and highest fidelity indicators of whether the system is working are the larger/aggregate signals, but we pay for that in flakiness, resource cost, and difficulty in digging down into root-causes for debugging. Anything we can reasonably diagnose earlier will be cheaper to resolve.
In discussions of alerting, a prominent idea is to focus on black-box and end-to-end probing to generate alerts, because in practice many cause-based alerts are hard to tune and perhaps non-actionable in production environments. Imagine an SRE conversation: "We got a 2% bump in retries in the past hour, which put us over the alerting threshold for retries-per-day." "Is the system suffering as a result? Are users noticing increased latency or increased failed requests?" "No." "Then … ignore the alert I guess. Or update the failure threshold." In this scenario, the alerting threshold for this cause-based alert is brittle: someone arbitrarily put a line in the sand to say, "this value should never exceed X," without any fundamental truth to that assertion: it's weakly correlated with the thing that matters – whether users are going to notice any degradation of service.
There's a direct connection between that common antipattern in cause-based alerting and brittle failures in unit tests. Imagine a software engineer conversation: "We got a test failure from our CI system. The image renderer test is failing after someone upgraded the JPEG compressor library." "How is the test failing?" "Looks like we get a different sequence of bytes out of the compressor than we did previously." "Do they render the same?" "Basically." "Then … ignore the alert I guess. Or update the test." In this scenario, a test failure and CI alert is caused by brittle dependence expressed on irrelevant surface features of the underlying system, the image compressor. The team doesn't actually care about the specific sequence of bytes that is output by the compressor: all that matters is that the bitmap produced by decoding that as a JPEG is well encoded and visually similar.
In both the Alerting and CI cases, when there isn't enough high-level expressive infrastructure to easily assert the thing that matters, teams will naturally tend toward the easy-to-express-but-brittle thing. If you don't have an easy end-to-end probe, but you do make it easy to collect some aggregate statistics, teams will write threshold alerts based on arbitrary statistics. If you don't have a high-level way to say, "Fail the test if the decoded image isn't roughly the same as this decoded image," teams will instead build tests that assert that the byte streams are identical. Brittleness reduces the value of our testing and alerting systems by triggering on false-positives, but also serves as a clear indication of where it may be valuable to invest in higher level design.
While monitoring and alerting are considered a part of the SRE / production management domain, where the insight of Error Budgets is well understood, CI comes from a perspective that still tends to be focused on absolutes. We often encounter teams with stated goals of "100% passing rate on tests, or that have policies where when tests start failing, nobody can submit until the tests are fixed.