Acto is a push-button end-to-end testing technique for Kubernetes operators which are custom controllers for managing deployed systems atop Kubernetes. Acto uses a state-centric approach to test an operator together with its managed system. It checks if an operator satisfies three operation correctness properties: 1) always reconciling the managed system to the desired states, 2) always recovering the system from undesired or error states, and 3) always being resilient to misoperations. Acto has helped find more than 80 new bugs in popular Kubernetes operators and is maintained as an open-source project.
Cloud systems are growing in scale and demand beyond what human-based operation can reliably, continuously, and efficiently manage. Today, cloud systems deployed on platforms such as Kubernetes are increasingly being managed by mechanical “operators” [1, 3, 12, 22] that automate labor-intensive operations. Kubernetes operators implement declarative interfaces which define the managed system resources and their properties [2]. An operation declares the desired system state through the interface and the op- erator automatically reconciles the system from its current state to the declared state. This “cloud-native” operator pattern effectively simplifies operations and improves efficiency.
Today, there is a thriving ecosystem of high-quality, reusable operators on Kubernetes—almost all cloud- native systems have operators to manage them atop Kubernetes. These operators automate important management tasks like software upgrades, configuration updates, and autoscaling. Even for the same cloud system, multiple different operators are developed by commercial vendors and open-source communities, to support different operation practices and deployment environments.
The rapid development and deployment of operators make their quality assurance a pressing need—operation correctness is critical to system reliability [6]. A buggy operator can impair correctly implemented systems in production. Compared with human operator mistakes—major causes of system failures [9,13,20, 21, 27]—bugs in operators have more magnified impacts due to the nature of automation and widespread software reuse. In fact, buggy operators caused many recent production incidents [11, 15, 17–19].
Figure 1 depicts a safety bug that our tool detects in a Kubernetes operator for ZooKeeper [16]. When scaling down a ZooKeeper cluster, the operator only removes pods, but not the data volumes attached to the pods. If the operator later scales up the ZooKeeper cluster, the newly created pods will try to reuse the old volumes. Due to membership inconsistencies between the new pods and old volumes, the new ZooKeeper nodes fail to start. Moreover, all subsequent scaling operations hang inside the operator.
Compared with Kubernetes and the managed systems (e.g., ZooKeeper), operator code is often much less tested. For example, our study [14] shows that existing Kubernetes operators rely mostly on unit tests which cannot check operation correctness end to end, i.e., if an operator reconciles the managed system to desired states. Some operators have a few end-to-end (e2e) tests but only cover small parts of the enormous system state space and the complex operations exposed by declarative interfaces.
We present Acto, the first automatic testing technique and a push-button tool for Kubernetes operators. Acto is fully automatic—it tests unmodified operators and requires no manual annotation, instrumenta- tion, or assertion. Acto uses a state-centric approach to test a given operator together with its managed system. Acto continuously instructs an operator to reconcile the system to different states and checks if the system successfully reaches those desired states during a test campaign. To do so, Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto checks three operation correctness properties:
- always reconciling the managed system to the desired states,
- always recovering the system from undesired or error states, and
- always being resilient to misoperations where the desired states are invalid, such as misconfigurations [23, 26].
Acto has helped find more than 80 new bugs (at least 62 were confirmed and 41 have been fixed) with few false alarms (less than 0.19%). Acto also found six bugs in Kubernetes and in the Go runtime that affected multiple operators (all have been confirmed or fixed). The detected bugs lead to severe safety and liveness issues, affecting not only the operators, but also the reliability and security of the managed systems. We also find that existing operators have poor resilience to misoperations which would render the system into unrecoverable states. For a given Kubernetes operator, Acto’s testing finishes within eight hours (a nightly run) on a cluster of eight machines; the majority of operators only need one machine.
The Acto project is open sourced at https://github.com/xlab-uiuc/acto.
Kubernetes operators use a declarative, state-reconciliation design pattern [1, 3, 12, 22]. An operation declares a desired system state and the operator automatically reconciles the system to the declared state. This design pattern simplifies system management operations by removing the need to write ad hoc, imperative scripts for one-off tasks. The pattern also makes system management declarative and intent-driven.
In Kubernetes, operators expose a declarative interface in the form of custom resources CRs [2]. A CR defines a system resource and its properties that can be modified to manage that resource. A state declaration specifies property values in a CR. Figure 2 shows an example of desired-state declarations for ZooKeeper; it specifies primitive properties like replicas and image, and composite properties like persistence which has sub-properties. A ZooKeeper operator reconciles a managed ZooKeeper cluster to satisfy the declared state. Management operations are expressed by changing one or more property values in a CR.
Every Kubernetes operator continuously reconciles the managed system from its current state to a newly declared desired state, if the current state does not match the declared state. Kubernetes manages the current system states in a collection of state objects in etcd, a strongly consistent datastore. Every entity in the cluster, such as a pod, a volume, and a stateful application, has a corresponding state object. State objects have uniform APIs and consistent data schema, making them highly interpretable and extensible [10].
Acto is a state-centric testing technique. It tests operation correctness by performing end-to-end (e2e) test- ing of Kubernetes operators together with the managed systems. To do so, Acto continuously generates new operations during a test campaign, and checks if the operator always correctly reconciles the system from each current state to the desired state, or raises an alarm otherwise.
Acto detects bugs when operation correctness is violated. Such bugs include those that 1) cause an operator not to reconcile the system to desired states, 2) crash the operator or the system, and 3) prevent the managed system from recovering from an error state. Acto also detects vulnerabilities to misoperations that can drive the systems into explicit error states.
Acto generates minimized e2e test code for every alarm that it raises. These generated tests can help developers reliably reproduce a bug or a vulnerability, without rerunning the entire test campaign. That is, generated e2e tests only run operations that are necessary to set up the state for reproducing a bug or a vulnerability. Developers can include the generated e2e test in their regression test suite.
Acto models an operation as a pair, (Sc, D), where Sc denotes a current system state and D is a declaration of a valid desired state. D is constrained by the operation interface specification (CRD [2] in Kubernetes). If successful, an operation triggers a state transition, Sc to SD , where SD satisfies D. D often only specifies a (small) part of the system state. So, there are multiple possible system states that can satisfy D, and, in practice, only a small part of S needs to be examined to check if SD satisfies D.
If an operation fails (e.g., due to bugs in operator code), the system enters an error state, Se, which does not satisfy the desired state D. When Se does not satisft D, the operator should be able to recover the system back to the previous healthy state from Se by means of a state transition using the desired-state declaration Di-1 that previously triggered the transition to Sc.
The fundamental challenge in testing operators is the prohibitive cost of testing all elements in the Cartesian product of S = SC ∪ SE and Ď, where SC is the set of all possible valid system states (Sc ∈ SC ), SE is the set of all possible error states (Se ∈ SE), and Ď is the set of all possible declarations of desired state (D ∈ Ď). There can be a large number of values for different properties that constitute the system state. Exhaustive testing could be prohibitively expensive, and any practical testing approach can only exercise a part of the state space, i.e., S × Ď.
Acto systematically explores the state space using the following three test strategies (Figures 3a–c).
Single operation. Acto generates a declaration of a desired state D, triggers the operator to reconcile the current system state Sc to the desired system state SD, and checks whether SD |= D. The single operation is applied to the initial system state Sc = S0 (starting from a non-initial state requires more operations). The key challenge is how to explore an effective and representative subset of Ď.
Operation sequence. Acto extends single operations into a test campaign, which consists of a sequence of operations. Test campaigns overcome the limitation of the single-operation strategy, which must always start from the initial state Sc = S0. It is important to test whether an operator can reconcile the system to desired states from different, non-initial start states. Reaching an end state from different start states increases the chance of invoking different procedures in the operator code. In a test campaign, earlier operations take the system to new states which become the start states for subsequent operations.
Acto generates a test campaign by chaining the expected end states {Si} from the single-operation strategy, and generating a new Di after each successful reconciliation, as shown in Figure 3b. The result is a sequence of state transitions; after each transition Acto checks whether the expected end state Si satisfies the desired state Di.
Error-state recovery. The operation-sequence strategy does not test whether an operator correctly restores a system from implicit or explicit error states. If the system is in an error state Se, the operator is responsible for recovering from Se by reconciling the system from Se back to the prior healthy state Si-1. The subsequent operations start from Si-1, such as in the transition from Si-1 → Si+1, in Figure 3c. Error states can be reached because of operator bugs that reconcile the system to a state Se which does not satisfy desired state D, or misoperations—semantic errors in D that escape syntactic validation against the interface specification.
Acto combines these three test exploration strategies (Figures 3a–c) to realize the state transition sequences in one test campaign, as shown in Figure 3d.
We use the bug in Figure 1 as an example to illustrate Acto’s test strategy. When testing the ZooKeeper operator, as part of the operation sequence, Acto applies Dk (a ZooKeeper CR) that desires five ZooKeeper replicas, triggering the operator to set up a ZooKeeper cluster with five replicas (pods) running. Acto then applies Dk+1 by reducing the desired replica number to three. The operator then scales down ZooKeeper by deleting two pods, but does not delete their volumes due to the bug. Finally, Acto applies Dk+2 that raises the replica number back to five. The operator creates two pods directly reusing the old volumes. Due to the bug, ZooKeeper gets stuck in an error state: the membership configurations on the old volumes are not updated, and the newly created pods keep crashing. Acto flags this bug using its test oracles.
To reproduce this bug without going through all the operations, Acto generates a minimized operation sequence that deterministically triggers the bug.
We describe the main components of Acto and how we implement them. These components embody Acto’s state-centric testing technique; they generate declarations of desired system states, execute test campaigns, and check reconciled states using automated test oracles.
The Acto tool takes the following inputs: 1) a manifest for deploying the operator, 2) the specification of state declaration, i.e., the operator’s CRD [2], and 3) optionally the operator’s source code. Acto outputs test results, debugging information, and minimized test code that reproduces detected failures. Acto runs tests on virtualized Kubernetes clusters. It supports three backends: Kind, Minikube, and K3d.
During a test campaign (Figure 3d), Acto automatically generates a new state declaration Di+1 based on the current system state Si to realize a state transition from Si → Si+1. Test campaigns start from the initial state S0. Acto triggers state transitions with the goals to 1) cover all properties exposed by the operation interface, and 2) exercise representative operation scenarios based on property semantics.
Acto systematically exercises all the properties that are defined in the operation interface. Each new Di+1 changes one property in the current state Si and any other properties that are needed to satisfy predicates on property relationships. Specifically, Acto selects a previously untested property and uses it to declare a new desired state. The end state after one transition becomes the start state for the next transition (Figure 3b). All state declarations collectively change every property at least once during a test campaign.
Acto tests different scenarios based on the semantics of the changed properties. (Acto automatically infers these semantics). Table 1 gives a few such scenarios. For example, Acto tests the scale-up-and-scale- down and the scale-down-and-scale-up sequences if a property represents the number of replicas. Acto also tests different pod assignments that trigger the operator to re-configure or re-deploy managed systems differently. This scenario-driven approach allows Acto to focus on a small number of representative states, instead of the very large set of all possible property values. We implement the scenarios as plugins that can be extended or customized; users of Acto can implement more scenarios and support system-specific properties such as system configurations.
Property | Scenarios |
---|---|
Replicas | Scale up and then down; scale down and then up; upscale over system resource limit. |
Affinity | Place all pods on one node; spread pods to different nodes; set unsatisfiable affinity rules. |
Storage | Expand storage volumes; shrink storage volumes; request more storage than is available in a cluster. |
Access | Switch between normal and privileged roles. |
Acto also generates misoperations, each of which triggers a state transition to an error state, Se. For ex- ample, Acto generates misoperations that 1) scale the replicas beyond the total number of available physical resources, and 2) set unsatisfiable affinity rules (Table 1). Acto uses misoperations to check if an operator 1) is resilient to operation errors and 2) can recover from undesired or error states. Acto’s oracles check the former (is the system in a state Se?). Acto checks the latter by rolling back Se to the most recent healthy state. Misoperations that declare semantically erroneous states could escape constraint validation. A correct operator should not carry out an erroneous operation or at least should recover from operation failures.
Acto generates desired-state declarations, D ∈ Ď , that are syntactically valid, resemble real-world scenarios, and satisfy predicates on property relationships. Such desired states improve the effectiveness and efficiency of Acto’s state space exploration. End-to-end tests are expensive, so a D that does not satisfy these conditions has a low chance of finding bugs.
Acto ensures that all property values in declared desired states are syntactically valid using the opera- tion interface specification. (Invalid declarations would likely be directly rejected by the Kubernetes API servers before reaching the operator.) Kubernetes’ OpenAPISchema specification defines constraints on all supported properties. For composite properties, Acto uses composite constraints like required properties and also derives constraints from the sub-properties. For primitive properties, Acto uses constraints like the type, min/max values (for numeric types), length (for string type), regular-expression patterns, etc.
To exercise various operation scenarios, Acto changes properties based on their semantics. Acto in- fers the semantics of a property in the interface specification by mapping it to a set of resource types in the Kubernetes core APIs. Such mapping is feasible because many operations for property changes are eventually delegated to Kubernetes core services. Acto exploits the insight that property structure is ef- fective for mapping to properties in the Kubernetes core resource specification. Specifically, all Kuber- netes core resource types have unique structures. Figure 4 exemplifies how Acto infers semantics from the property structure: CassOp has a cassandraDataVolumeClaimSpec property with the same structure as the VolumeClaimTemplates property in Kubernetes’ StatefulSet resource. Therefore, Acto infers the semantics of cassandraDataVolumeClaimSpec using a structural mapping. When provided with operator source code, Acto can obtain more complete mapping via static program analysis that tracks how the property value is used in the operator code via its data flows.
To generate values for properties with inferred semantics, Acto currently implements 57 property- specific generators based on Kubernetes resource semantics. Most of these properties are composite. The generators focus on high-level semantics to exercise different scenarios (Table 1). Each generator creates property values to realize a scenario. We find that most properties exposed by operation interfaces (83% on average in our evaluated operators) can be mapped to Kubernetes resources. For properties whose se- mantics Acto cannot infer, Acto mutates current values based on their data types while satisfying syntactic constraints. Acto only mutates primitive sub-properties of composite properties.
Lastly, the values Acto generates should satisfy predicates, in the form of property dependencies, for changed property values to trigger state transitions. Acto automatically infers property dependencies from naming convention. In Kubernetes, dependencies can be identified by feature toggles—each composite property has a Boolean sub-property named “enabled.” For example, operations that change PCN/MongoOp’s backup policy must also set Backup.Enabled to True. With operator source code, Acto can also detect dependencies among property values by analyzing control-flow relationships among program variables.
Acto’s test oracles check if the system state after an operation matches the desired state. If there is a match, Acto reports the operation as successful. Otherwise, Acto signals an alarm that the user can inspect to find bugs. The complexity of Acto’s oracles depends on whether mismatches between reconciled and desired states manifest explicitly or implicitly. Acto implements oracles to check for state mismatches that manifest as explicit error states, such as exceptions, error codes, and timeouts.
Acto also implements oracles to check if Si satisfies Di for each state transition, as many operator bugs manifest as implicit-state mismatches with no explicit symptoms. Checking whether Si satisfies Di is challenging. First, Si and Di are represented differently: Di is a specification [2] and Si is embodied in state objects [4]. Second, satisfiability is domain-specific; its semantics may not be obvious. To address these challenges, Acto devises the consistency oracle and differential oracle.
In addition, Acto also has an interface to allow users to add custom oracles with domain-specific knowledge, e.g., a probe that tries to set and get some path in ZooKeeper.
Some bugs occur if an operator stops reconciliation because the system is in state Si which satisfies D in the operator’s view, but which does not satisfy D in Kubernetes’ view. To detect such bugs, Acto additionally checks whether the Kubernetes’ view matches D; the Kubernetes’ view is encoded in spec sections of state objects, which are jointly maintained by all running controllers and operators. For each transition from Si−1 → Si, Acto attempts to match each property p (specified in Di) to the corresponding spec fields in the state objects. If a match is found, it indicates that Kubernetes agrees with the operator. Otherwise, Acto raises an alarm.
The differential oracle does not check against Di; it checks that an operator 1) reconciles to the matching desired states from different existing states Si−1 and S0, and 2) recovers the system from (implicit or explicit) error state Se to state Si−1. Acto rolls back to Si−1 to continue exploration from a known good state. Figure 5 shows a bug detected by the differential oracle. There, the Boolean KnativeOp property contour.enabled enables or disables Contour (an ingress controller). But, a KnativeOp bug makes it fail to disable Contour once it is enabled. The consistency oracle does not detect this bug: it is hard to automat- ically map the Boolean property to the existence of a Contour pod. The differential oracle detects the bug because a Contour pod appears in Si, but not in S'i.
Note that reporting alarms for any difference in the state objects of Si and S'iwould be brittle and lead to false positives, because execution-specific values like timestamps, IP addresses, and ports may change nondeterministically. Acto excludes execution-specific fields when comparing state objects. Acto automatically labels those fields by 1) running the transition S0 → S1 multiple times as a calibration and labeling fields with values varying across runs, and 2) running S0 → S1 multiple times, iff the differential oracle fires an alarm on Si, to ensure relevant fields are deterministic.
In our original SOSP paper [14], we rigorously evaluated Acto with eleven popular open-source Kubernetes operators which manage nine cloud systems. All evaluated operators were developed by the official teams of the managed systems, or by companies that sell services built around the managed systems. Acto found new bugs in every evaluated Kubernetes operator, and in total found 56 unknown bugs in all the evaluated operators. We had reported all these bugs. At that time, 42 were confirmed and 30 have been fixed. No bug report was rejected. Acto also finds six bugs in Kubernetes and in the Go runtime that affect multiple operators; all were confirmed or fixed.
Since then, we have been continuously developing the Acto project, and Acto has been used to test more Kubernetes operators. Recently, we designed an assignment on Kubernetes controller reliability based on the Acto project for CS 523 (Advanced Operating Systems) at the University of Illinois Urbana-Champaign and used it in the semester of Spring 2024, with the purpose of teaching cloud computing concepts and cloud-native technologies. Many students in the course have successfully applied Acto to more than 40 open-source Kubernetes operators. Most students find Acto easy (and fun) to use and effective in finding defects in existing operators. Students are encouraged to report the bugs they find back to the developers. So far, Acto has helped find more than 80 new bugs (at least 62 were confirmed and 41 have been fixed). The project maintains the list of bugs found by Acto [8].
During the process, students continuously improve Acto and add new features. For example, Acto now has support for Kubernetes operators written in Java and Rust, in addition to Go. Acto also starts to support simple crash testing [24]. Acto has also been used in other research projects on Kubernetes reliability. For example, Acto was used to empirically evaluate formally verified Kubernetes controllers [25].
The original Acto paper is available at https://github.com/xlab-uiuc/acto/blob/main/docs/acto- paper_sosp2023.pdf.