Sieve: Chaos Testing for Kubernetes Controllers

November 14, 2024

Research

Authors:

Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith Suresh, Tianyin Xu

Article shepherded by:

Laura Nolan

Modern cluster managers such as Kubernetes are architected as a cluster of loosely-coupled controllers, each running as a microservice. In Kubernetes, all the cluster management logic is encoded in different controllers. These controllers include builtin controllers for managing cluster resources and providing management services (e.g., the Kubernetes StatefulSet controller and Pod autoscaler) and custom controllers for managing specific applications (e.g., a Cassandra controller). Today, thousands of controllers are implemented by commercial vendors and open-source communities to extend Kubernetes with new capabilities [8, 13, 15]. All these controllers perform critical operations, such as resource provisioning, software upgrades, configuration updates, and autoscaling, making their correctness paramount.

Achieving controller correctness is fundamentally challenging. Modern cluster managers follow the state-reconciliation principle that each controller continuously monitors a subset of the cluster state and reconciles the current state of the cluster to match a desired state. A reliable controller should reach the desired state starting from any potential cluster state while tolerating unexpected failures, networking interruptions, concurrency and asynchrony issues. Buggy controllers may cause severe failures, including application outage, data loss, and security issues.

For example, Figure 1 shows a bug in a Kubernetes controller for managing Cassandra [5]. The bug prevents the Cassandra cluster from auto-scaling and leaks storage resources (decommissioned volumes in gray are never deleted). This is because the controller lacks crash safety—it fails to recover from an intermediate state due to a crash between deleting a Cassandra pod and updating the Finalizing phase.

Figure 1: A bug in a Cassandra controller detected by Sieve [5]. The controller cannot recover from an intermediate state introduced by Sieve using a crash. As a consequence, the controller cannot auto-scale the Cassandra cluster and leaks storage resources. The bug has been fixed. The code snippet is significantly simplified for clarity; the real code spans 70+ functions and 2,000+ lines of Go.

The above crash-safety bug is only one of the myriad kinds of reliability issues that affect controllers. We find that controllers also experience bugs caused by state inconsistencies due to effects of asynchronous operations or uncoordinated concurrent interactions between controllers. For example, a controller might not always observe the latest version of the cluster state and might miss some version of the cluster state [19].

Sieve is a chaos testing tool for cluster management controllers. Sieve is powered by a fundamental insight that a controller’s actions are strictly a function of its view of the current cluster state—a controller constructs its internal state and takes actions to achieve the desired state based on the cluster state it observes. Sieve drives unmodified controllers to their potentially buggy corners by systematically and extensively perturbing the controller’s view of the cluster state. Sieve’s perturbations are realized by injecting faults (e.g., crashes) that controllers should tolerate.

Different from many existing chaos testing tools, Sieve performs exhaustive testing without depending on hypotheses about vulnerable regions in the code where bugs may lie. Sieve tests a controller by exhaustively introducing state perturbations through failures, delays, and reconfigurations. To detect diverse bugs with different causes, Sieve supports three perturbation patterns that expose controllers to 1) intermediate states (Figure 1), 2) stale states (or past cluster states), and 3) unobserved states due to missing some cluster state transitions. Sieve detects both safety and liveness bugs using automatic differential oracles that compare the cluster-state transitions with and without perturbations. Sieve also deterministically reproduces the detected bugs to help developers localize bugs in the source code and continuously iterate on bug fixes.

Sieve has detected 46 new bugs with serious consequences in ten popular Kubernetes controllers. These controllers manage critical cloud applications, including Cassandra, MongoDB and ZooKeeper. For each tested controller, Sieve’s testing finishes within seven hours (a nightly run) on a cluster of 11 machines.

Sieve is publicly available at https://github.com/sieve-project/sieve.

1 The State-reconciliation Principle

All Kubernetes controllers follow the state-reconciliation principle. Concretely, in Kubernetes, the cluster state is represented as a collection of objects stored in a distributed datastore, i.e., etcd in most cases. The datastore is logically centralized as it uses a consensus protocol to achieve consistency. Every entity in the cluster has a corresponding object in the cluster state, including pods, volumes, nodes, and groups of applications. All controllers interact with the cluster state via an ensemble of API servers using a REST API. The controllers continuously monitor a part of the cluster state and perform state reconciliation whenever the current state does not match the desired state. The controllers perform reconciliation by querying and manipulating the state objects via an API server. When querying an object, a controller might issue a quorum read on etcd for consistency, or directly read from the API server’s local cache for performance. Figure 2 illustrates how a controller interacts with the cluster state typically.

Figure 2: How a controller interacts with the cluster state. The controller reads the objects from its local cache (which is populated by notifications from the API server) and updates the objects stored in etcd.

2 Sieve’s Approach

The key idea of Sieve is to automatically and extensively perturb an unmodified controller’s view of the cluster states in ways it is expected to tolerate. Sieve leverages the fundamental nature of state-reconciliation systems – these systems often have a simple and highly introspectable state-centric interface with which controllers interact with the cluster state. Such interfaces essentially do no more than reads and writes, or receive notifications regarding state-object changes. All objects share a common schema, which makes any arbitrary object highly introspectable. In Kubernetes, the state-centric interface is the REST API (in client-go [2]) used by controllers to Get, List, Create, Update and Delete state objects, and all state objects have an identical set of fields representing their metadata (ObjectMeta). This enables a degree of automation that is hard to achieve otherwise.

Sieve performs exhaustive reliability testing. For each test workload, Sieve first generates a reference run by running the workload without any perturbation. Sieve then analyzes the reference run to generate test plans. A test plan describes a concrete perturbation, including what faults to inject and when to inject them to effectively drive the controller to see the target cluster state. For example, to test controllers against intermediate cluster states, Sieve generates test plans that encode each potential point to inject a controller crash. When testing the Cassandra controller in Figure 1, Sieve covers the crash points including after deleting the pod, updating the phase and deleting the volume.

To achieve high test efficiency, Sieve prunes redundant or futile test plans. Sieve avoids a test plan if it is clear that it cannot causally lead to a new target cluster state. As an example, when introducing intermediate states, Sieve crashes the controller only after effective state updates – ineffective updates, such as deleting a non-existing object, do not introduce any new cluster state. Sieve’s test pruning technique reduces test plans by 46.7%–99.6% in our experience.

To help developers debug test failures, Sieve deterministically reproduces each bug triggered by its perturbation by precisely replaying the bug-triggering fault injection. To reproduce the bug in Figure 1, Sieve injects a crash right after the controller deletes the pod and before it updates the phase to Finalizing in each repeated test run with the same test plan. Sieve’s reproducibility helps us localize the bug in the source code and develop a patch that fixes this bug. To precisely control the timing of fault injection, Sieve automatically instruments the client-go library and recompiles the controller with the instrumented library (for this reason, the controller source code must be available).

The key techniques that power Sieve’s bug finding ability are its 1) perturbation patterns for triggering diverse bugs with different causes, and 2) differential test oracles for catching bugs that cause safety and liveness violations. We now present how Sieve’s perturbation patterns and differential test oracles work.

2.1 Perturbation Patterns

Sieve’s perturbations are produced by injecting targeted faults (e.g., crashes, delays, and connection changes) when specific cluster-state changes (triggering conditions) happen. Notably, the perturbation strategy allows Sieve to decouple policy from mechanism. The decoupling makes it easy to extend existing policies or add new policies by orchestrating the underlying perturbation mechanisms. Specifically, a policy defines a view Sieve exposes to the controller at a particular condition, while the mechanism specifies how to inject faults to create the view. Sieve automatically generates test plans for each policy; each test plan introduces a concrete perturbation based on a specification of a triggering condition and a fault to inject when that condition happens.

Sieve currently supports three patterns (or policies) to perturb a controller’s view:

intermediate states,
stale states, and
unobserved states.

They represent valid inconsistencies in the view that a controller could see due to common faults as well as the inherent asynchrony of the overall distributed system. Note that these are not the only patterns in which faults can occur, but cover a broad range of faults that a component in a distributed system is expected to handle gracefully. Sieve can be extended to incorporate other patterns in the future.

Intermediate states. Intermediate states occur when controllers fail in the middle of a reconciliation before finishing all the state updates they would have otherwise issued. After recovery (e.g., Kubernetes automatically starts a new instance of a crashed controller), the controller needs to resume reconciliation from the intermediate state left behind.

Figure 3: An intermediate-state bug in a RabbitMQ controller detected by Sieve [14]. The controller fails to recover from the intermediate state introduced by Sieve; the controller does not successfully resize the storage volume.

Figure 3 illustrates how Sieve tests the official RabbitMQ controller with intermediate-state perturbations and reveals a new bug. The test workload attempts to resize the storage volume from 10GB to 15GB. The resizing is implemented with two updates: 1) updating VolCur to 15GB; 2) updating VolReq to 15GB which triggers Kubernetes to resize the volume. The controller issues updates when VolCur is smaller than the desired volume size. During testing, Sieve crashes the controller between the two updates, which creates an intermediate state where VolCur is updated, but VolReq is not. The controller cannot recover from the intermediate state and the resizing never succeeds. The bug has been fixed with 700+ lines of Go code to revamp the volume resizing logic. In addition, the developers added eight new tests along with the fix to exercise how the controller handles different intermediate states, which is what Sieve performs automatically.

Stale states. Controllers often operate on stale states, due to asynchrony and the extensive uses of caches for performance and scalability. As shown in Figure 2, controllers do not directly interact with the strongly consistent data stores, but are connected with API servers. The states cached at API servers could be stale due to delayed notifications. Controllers are expected to tolerate stale views that lag behind the latest states maintained in the data store.

Tolerating stale views correctly is nontrivial. For example, a Kubernetes controller’s view may “time travel” to a state it observed in the past. Time traveling occurs when there are multiple API servers operating in a high-availability setup, when the controller reconnects to a stale API server that has not yet seen some updates to the cluster state. The reconnection can be triggered by failover, load balancing, or reconfigurations. Controllers are expected to avoid updating the cluster state wrongly based on its stale view of the cluster state. For example, when sending a deletion request the controller can piggyback its most recently observed cluster state’s resource version (in the preconditions), and ask etcd to check the freshness of the resource version before the deletion takes effect.

Figure 4: A stale-state bug in a MongoDB controller detected by Sieve [10]. The controller experiences a “time-travel” and observes a stale state. It makes wrong reconciliation action based on the stale state (deleting all the pods and volumes) which leads to application outages and data loss.

Figure 4 illustrates how Sieve tests Percona’s MongoDB controller with stale-state perturbation and reveals a new bug that leads to both application outages and data loss. To support graceful MongoDB cluster shutdowns, the controller waits to see a non-nil deletion timestamp (DeletionTimestamp) field attached to the state object representing the MongoDB cluster (a common practice to give systems time to react to an impending deletion [3]). When the controller sees this change, it deletes all the pods and volumes of the MongoDB cluster.

Sieve drives the controller to mistakenly delete a live MongoDB cluster by introducing a time-travel perturbation. With a workload that first shuts down a MongoDB cluster and then recreates a new instance of the same cluster, Sieve waits till the cluster is recreated and then introduces a time-travel perturbation. The perturbation causes the controller to see the deletion timestamp being applied to the already-deleted cluster. Consequently, the controller mistakenly shuts down the newly created cluster. This revealed that the controller should be checking for the UIDs of clusters, not just their names.

Unobserved states. By design, controllers may not observe every cluster-state change in the system. The full history of changes made to the cluster state is prohibitively expensive to maintain and expose to clients [19]. Controllers are hence expected to be designed as level-triggered systems (opposed to being edge-triggered), i.e., a controller’s decision must be based on the currently observable cluster state (level) [9], not on seeing every single change to the cluster state (edge).

Figure 5: An unobserved-state bug in a Cassandra controller detected by Sieve [4]. The controller misses a transient state where the pod has a non-nil deletion timestamp. It thus fails to delete the volumes, leaking storage resources. The bug also prevents new Cassandra pods from rejoining.

Figure 5 illustrates how Sieve tests Instaclustr’s Cassandra controller using unobserved-state perturbations and reveals a new bug that leads to resource leaks and service failures. The test workload first scales down and then scales up storage volumes of the Cassandra cluster. During scale-down, the controller removes volumes when it learns that the corresponding pods were marked for deletion (a non-nil deletion timestamp field is set on the pod object, similar to the previous example). The pods’ lifecycles (including deletions) are managed by a built-in controller called a StatefulSet controller. Sieve pauses notifications to the Cassandra controller for a window such that it does not see these deletion marking events by the StatefulSet controller. This causes the Cassandra controller to not delete the corresponding volumes even though it has the right information to make that call (i.e., its view has volumes created by it that do not have pods attached to them).

Hence, the volume never gets deleted, leaking the storage resource. The bug also prevents the controller from scaling the Cassandra cluster – newly-created pods try to reuse the dangling volumes and cannot rejoin using the cluster metadata already in them (as it represents a node that was decommissioned). The bug has been fixed by adding finalizers – a coordination mechanism in Kubernetes that allows the Cassandra controller to complete the required cleanup operations before the pods can be deleted.

2.2 Differential Test Oracles

Sieve has generic, effective oracles to automatically detect safety and liveness issues. The oracles detect buggy controller behavior based on the cluster states during and at the end of the test run.

In our experience, many buggy controller behaviors do not show immediate or obvious symptoms (e.g., crashes, hangs, and error messages). Instead, they lead to data loss, security issues, resource leaks, and unexpected application behavior which is hard to check. We therefore develop differential test oracles that compare cluster states in a reference run versus those in test runs—with inconsistencies typically indicating buggy behavior.

We found that Sieve’s differential oracles vastly outperform developer-written assertions in the test suites of the controllers we evaluated, because Sieve’s oracles systematically examine all the state objects and their evolution during testing. It is challenging for developers to manually codify oracles that comprehensively consider the large number of relevant states.

Note that Sieve also implements regular error checks for obvious anomalies, including exceptions, error codes and timeouts. Developers can also add domain-specific oracles.

2.2.1 Checking End States

Sieve systematically checks the end state after running a workload. Specifically, our oracles check the count of state objects by type and the field values of all the objects. It compares the end state of the test run versus the reference run. Sieve fails the test if it finds inconsistencies between the end states and prints human-readable messages to pinpoint inconsistencies.

For example, in a MongoDB controller bug [11], the controller fails to create an SSL certificate used for securing communications inside the MongoDB cluster. This causes the controller to fall back to insecure communications. Such security issues do not manifest in the form of crashes or error messages. Sieve however automatically catches the bug, because the certificate object in the faulty run does not exist in the cluster state, which is different from a normal run.

2.2.2 Checking State-Update Summaries

Besides the end state, Sieve also checks how the controller updates the cluster state over time. It does so by comparing summaries of constructive and destructive state updates for each object (e.g., Create and Delete operations). Such checks are complementary to the end-state checks, because a correct end state does not imply that the controller behavior is always correct during the test. We find that buggy behavior can end in correct states (same as in the reference runs).

For example, a NiFi controller bug [12] causes the controller to fail to reload configuration files, but the end state is the same as a normal run. Sieve flags this by noting the NiFi pod receives a Create and a Delete operation (to reload the configuration) in the normal run, but neither appears in the faulty run.

2.2.3 Dealing with Nondeterminism

Sieve’s differential oracles can introduce false alarms because the shape of a state object (the set of fields and their values) might be nondeterministic. Sieve identifies nondeterministic field values by running the test workloads without perturbation multiple times, and then comparing the values of each field in each state object. If a field has nondeterministic values (typically IP addresses, timestamps, or even random port numbers), Sieve masks the field values when comparing the states. Note that Sieve can still spot unexpected changes to the set of fields on the object (e.g., missing deletion timestamp fields).

3 Our Experience with Sieve

We have applied Sieve to ten popular controllers from the Kubernetes ecosystem for managing widely-used cloud systems, including Cassandra, MongoDB and ZooKeeper. The controllers are either developed by the official development team of the corresponding system, or by companies that have production-grade offerings around said systems. To test each controller using Sieve, we provide 2–5 basic, representative end-to-end test workloads. Each workload exercises a feature of the controller (e.g., deployment, scaling, reconfiguration).

Sieve finds a total of 46 new bugs in the evaluated controllers. Those bugs include 11 intermediate-state bugs, 19 stale-state bugs, 7 unobserved-state bugs, and 9 bugs indirectly detected by Sieve during testing. Sieve finds new bugs in all the evaluated controllers. We have reported all these bugs. So far, 35 of them have been confirmed and 22 have been fixed. No bug report was rejected. Many bugs have severe consequences, such as application outages, security issues, service failures, and data loss. The Sieve project maintains the list of found bugs [1].

4 After Sieve

The original Sieve paper [17] was published in 2022 and we have continued working on testing Kubernetes controllers since that. After Sieve, we built Acto [6, 7], an end-to-end functional testing tool for controllers. Acto does not perform fault injection testing, but it complements Sieve by automatically generating high-coverage, representative test workloads that can be used by Sieve.

After seeing many bugs found by Sieve and Acto, we started to explore a new approach to guarantee controller correctness and reliability. Anvil [16, 18] is a framework that allows developers to use formal verification to build clean-slate controllers that are proved to be free of many types of bugs found by Sieve and Acto. We used Sieve and Acto to empirically evaluate the verified controllers built using Anvil.

5 Conclusion

Ensuring the reliability of Kubernetes controllers is a pressing and challenging problem. We present Sieve, a chaos testing technique for Kubernetes controllers. Sieve performs exhaustive and deterministic testing and is effective in finding bugs. Our goal is to make Sieve a part-and-parcel of every controller developers’ toolkit, and to harden the growing number of controllers that power today’s data centers. Sieve is publicly available at https://github.com/sieve-project/sieve. We refer readers interested in more technical details and evaluation results to the original paper which is available at https://github.com/sieve-project/sieve/blob/main/docs/paper-osdi.pdf.

Appendix

References:

[1] Automatic Reliability Testing for Kubernetes Controllers. https://github.com/sieve-project/sieve, 2024.

[2] kubernetes/client-go. https://github.com/kubernetes/client-go, 2024.

[3] ALPAR, A. Using Finalizers to Control Deletion. https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/, May 2021.

[4] CASSANDRA-OPERATOR-398. Reconcile() fails to delete the corresponding pvc if missing deletionTimestamp ofCassandra pod. https://github.com/instaclustr/cassandra-operator/issues/398, Jan. 2021.

[5] CASSKOP-370. [BUG] Casskop fails to clean up PVCs and refuses to handle user requests after crash and restart. https://github.com/Orange-OpenSource/casskop/issues/370, 2021.

[6] GU, J. T., SUN, X., TANG, Z., WANG, C., VAZIRI, M., LEGUNSEN, O., AND XU, T. Acto: Push-Button End-to-End Testing for Operation Correctness of Kubernetes Operators. In USENIX ;login: (Aug. 2024). https://www.usenix.org/publications/loginonline/acto-push-button-end-end-testing-operation-correctness-kubernetes-operators

[7] GU, J. T., SUN, X., ZHANG, W., JIANG, Y., WANG, C., VAZIRI, M., LEGUNSEN, O., AND XU, T. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP’23) (Oct. 2023).

[8] HALL, C. AWS, Google, Microsoft, Red Hat’s New Registry to Act as Clearing House for Kubernetes Operators. https://www.datacenterknowledge.com/open-source/aws-google-microsoft-red-hats-new-registry-act-clearing-house-kubernetes-operators, Mar. 2019.

[9] HOCKIN, T. Kubernetes: Edge vs. Level Triggered Logic. https://speakerdeck.com/thockin/edge-vs-level-triggered-logic, June 2017.

[10] K8SPSMDB-430. [BUG] Stale deletion timestamps lead to undesired statefulset and PVC deletion. https://jira.percona.com/browse/K8SPSMDB-430, 2021.

[11] K8SPSMDB-578. [BUG] Failure of creating SSL-internal certificates when the controller crashes and restarts at some particular point. https://jira.percona.com/browse/K8SPSMDB-578, 2021.

[12] NIFIKOP-49. [BUG] NiFi configuration cannot be reloaded if the controller crashes and restarts in the middle of a reconciliation. https://github.com/konpyutaika/nifikop/issues/49, 2021.

[13] PIPES, J., HAUSENBLAS, M., AND TABER, N. Introducing the AWS Controllers for Kubernetes (ACK). https://aws.amazon.com/cn/blogs/containers/aws-controllers-for-kubernetes-ack/, Aug. 2020.

[14] RABBITMQ-OPERATOR-782. [BUG] PVC expansion fails if the controller crashes in the middle of a reconciliation. https://github.com/rabbitmq/cluster-operator/issues/782, 2021.

[15] SOSA, C., AND BHATIA, P. Application management made easier with Kubernetes Operators on GCP Marketplace. https://cloud.google.com/blog/products/containers-kubernetes/application-management-made-easier-with-kubernete-operators-on-gcp-marketplace, May 2019.

[16] SUN, X., GU, J. T., RIVERA, C., CHAJED, T., HOWELL, J., LATTUADA, A., PADON, O., SURESH, L., SZEKERES, A., AND XU, T. Anvil: Building Kubernetes Controllers That Do Not Break. In USENIX ;login: (June 2024). https://www.usenix.org/publications/loginonline/anvil-building-formally-verified-kubernetes-controllers

[17] SUN, X., LUO, W., GU, J. T., GANESAN, A., ALAGAPPAN, R., GASCH, M., SURESH, L., AND XU, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22) (July 2022). https://www.usenix.org/system/files/osdi22-sun.pdf

[18] SUN, X., MA, W., GU, J. T., MA, Z., CHAJED, T., HOWELL, J., LATTUADA, A., PADON, O., SURESH, L., SZEKERES, A., AND XU, T. Anvil: Verifying Liveness of Cluster Management Controllers. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24) (July 2024). https://www.usenix.org/system/files/osdi24-sun-xudong.pdf

[19] SUN, X., SURESH, L., GANESAN, A., ALAGAPPAN, R., GASCH, M., TANG, L., AND XU, T. Reasoning about Modern Datacenter Infrastructures Using Partial Histories. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’21) (June 2021).

Article Categories:

SRE

Distributed systems

Cloud

Sysadmin

Last updated November 14, 2024

Authors:

Xudong Sun is a final year Ph.D. student in the Computer Science department at the University of Illinois Urbana-Champaign (UIUC). His research focuses on improving the reliability of modern cloud systems using systematic testing, model checking, and formal verification.

xudongs3@illinois.edu

Wenqing is currently a software engineer at Apple, focused on designing and developing reliable, efficient, and secure cloud infrastructure. He holds a master's degree from the University of Illinois Urbana-Champaign, where his research focused on cloud systems reliability. During his graduate studies, Wenqing worked on enhancing the reliability of cluster management and disaggregated storage systems through innovative automated testing methodologies.

wenqing4@illinois.edu

Jiawei Tyler Gu is a PhD candidate in the Computer Science Department of the University of Illinois Urbana-Champaign. His research focuses on improving the reliability of cloud system management.

jiaweig3@illinois.edu

Aishwarya Ganesan is an Assistant Professor in the Siebel School of Computing and Data Science at the University of Illinois at Urbana-Champaign. Her research interests are in distributed systems, storage and file systems. Her work has been recognized with best-paper awards (FAST '18 and FAST '20) and an NSF CAREER award.

aganesn2@illinois.edu

Ram Alagappan is an Assistant Professor in the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. His research interests include storage, distributed systems, and operating systems.

ramn@illinois.edu

Michael is a Senior Product Manager at AWS for EventBridge. His areas of interests are distributed and event-driven systems, open source, and container technology. He's a passionate Go developer and contributes to open-source projects, such as CloudEvents, in his spare time.

Lalith Suresh is the CEO & Co-Founder at Feldera. He was previously a Senior Researcher at VMware Research. His research has spanned topics like distributed systems, networking, databases and operating systems. He holds a PhD in computer science from TU-Berlin.

lalith@feldera.com

Tianyin Xu is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). His research focuses on building reliable computer systems that empower next-generation cloud and datacenter computing.

tyxu@illinois.edu