Acto: Push-Button End-to-End Testing for Operation Correctness of Kubernetes Operators

August 2, 2024

Research

Authors:

Jiawei Tyler Gu, Xudong Sun, Zhen Tang, Chen Wang, Mandana Vaziri, Owolabi Legunsen, Tianyin Xu

Article shepherded by:

Laura Nolan

Acto is a push-button end-to-end testing technique for Kubernetes operators which are custom controllers for managing deployed systems atop Kubernetes. Acto uses a state-centric approach to test an operator together with its managed system. It checks if an operator satisfies three operation correctness properties: 1) always reconciling the managed system to the desired states, 2) always recovering the system from undesired or error states, and 3) always being resilient to misoperations. Acto has helped find more than 80 new bugs in popular Kubernetes operators and is maintained as an open-source project.

1 Introduction

Cloud systems are growing in scale and demand beyond what human-based operation can reliably, continuously, and efficiently manage. Today, cloud systems deployed on platforms such as Kubernetes are increasingly being managed by mechanical “operators” [1, 3, 12, 22] that automate labor-intensive operations. Kubernetes operators implement declarative interfaces which define the managed system resources and their properties [2]. An operation declares the desired system state through the interface and the op- erator automatically reconciles the system from its current state to the declared state. This “cloud-native” operator pattern effectively simplifies operations and improves efficiency.

Today, there is a thriving ecosystem of high-quality, reusable operators on Kubernetes—almost all cloud- native systems have operators to manage them atop Kubernetes. These operators automate important management tasks like software upgrades, configuration updates, and autoscaling. Even for the same cloud system, multiple different operators are developed by commercial vendors and open-source communities, to support different operation practices and deployment environments.

The rapid development and deployment of operators make their quality assurance a pressing need—operation correctness is critical to system reliability [6]. A buggy operator can impair correctly implemented systems in production. Compared with human operator mistakes—major causes of system failures [9,13,20, 21, 27]—bugs in operators have more magnified impacts due to the nature of automation and widespread software reuse. In fact, buggy operators caused many recent production incidents [11, 15, 17–19].

Figure 1 depicts a safety bug that our tool detects in a Kubernetes operator for ZooKeeper [16]. When scaling down a ZooKeeper cluster, the operator only removes pods, but not the data volumes attached to the pods. If the operator later scales up the ZooKeeper cluster, the newly created pods will try to reuse the old volumes. Due to membership inconsistencies between the new pods and old volumes, the new ZooKeeper nodes fail to start. Moreover, all subsequent scaling operations hang inside the operator.

Figure 1: A safety bug [7] in a ZooKeeper operator, detected by Acto. The bug manifests when the operator scales down and then scales up ZooKeeper. Newly created pods fall into crash loops; all subsequent scaling operations hang.

Compared with Kubernetes and the managed systems (e.g., ZooKeeper), operator code is often much less tested. For example, our study [14] shows that existing Kubernetes operators rely mostly on unit tests which cannot check operation correctness end to end, i.e., if an operator reconciles the managed system to desired states. Some operators have a few end-to-end (e2e) tests but only cover small parts of the enormous system state space and the complex operations exposed by declarative interfaces.

We present Acto, the first automatic testing technique and a push-button tool for Kubernetes operators. Acto is fully automatic—it tests unmodified operators and requires no manual annotation, instrumenta- tion, or assertion. Acto uses a state-centric approach to test a given operator together with its managed system. Acto continuously instructs an operator to reconcile the system to different states and checks if the system successfully reaches those desired states during a test campaign. To do so, Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto checks three operation correctness properties:

always reconciling the managed system to the desired states,
always recovering the system from undesired or error states, and
always being resilient to misoperations where the desired states are invalid, such as misconfigurations [23, 26].

Acto has helped find more than 80 new bugs (at least 62 were confirmed and 41 have been fixed) with few false alarms (less than 0.19%). Acto also found six bugs in Kubernetes and in the Go runtime that affected multiple operators (all have been confirmed or fixed). The detected bugs lead to severe safety and liveness issues, affecting not only the operators, but also the reliability and security of the managed systems. We also find that existing operators have poor resilience to misoperations which would render the system into unrecoverable states. For a given Kubernetes operator, Acto’s testing finishes within eight hours (a nightly run) on a cluster of eight machines; the majority of operators only need one machine.

The Acto project is open sourced at https://github.com/xlab-uiuc/acto.

2 The Operator Pattern

Kubernetes operators use a declarative, state-reconciliation design pattern [1, 3, 12, 22]. An operation declares a desired system state and the operator automatically reconciles the system to the declared state. This design pattern simplifies system management operations by removing the need to write ad hoc, imperative scripts for one-off tasks. The pattern also makes system management declarative and intent-driven.

In Kubernetes, operators expose a declarative interface in the form of custom resources CRs [2]. A CR defines a system resource and its properties that can be modified to manage that resource. A state declaration specifies property values in a CR. Figure 2 shows an example of desired-state declarations for ZooKeeper; it specifies primitive properties like replicas and image, and composite properties like persistence which has sub-properties. A ZooKeeper operator reconciles a managed ZooKeeper cluster to satisfy the declared state. Management operations are expressed by changing one or more property values in a CR.

Figure 2: Scaling up a ZooKeeper system (from 2 to 3 replicas) with a new desired-state declaration (CR).

Every Kubernetes operator continuously reconciles the managed system from its current state to a newly declared desired state, if the current state does not match the declared state. Kubernetes manages the current system states in a collection of state objects in etcd, a strongly consistent datastore. Every entity in the cluster, such as a pod, a volume, and a stateful application, has a corresponding state object. State objects have uniform APIs and consistent data schema, making them highly interpretable and extensible [10].

3 Technique

Acto is a state-centric testing technique. It tests operation correctness by performing end-to-end (e2e) test- ing of Kubernetes operators together with the managed systems. To do so, Acto continuously generates new operations during a test campaign, and checks if the operator always correctly reconciles the system from each current state to the desired state, or raises an alarm otherwise.

Acto detects bugs when operation correctness is violated. Such bugs include those that 1) cause an operator not to reconcile the system to desired states, 2) crash the operator or the system, and 3) prevent the managed system from recovering from an error state. Acto also detects vulnerabilities to misoperations that can drive the systems into explicit error states.

Acto generates minimized e2e test code for every alarm that it raises. These generated tests can help developers reliably reproduce a bug or a vulnerability, without rerunning the entire test campaign. That is, generated e2e tests only run operations that are necessary to set up the state for reproducing a bug or a vulnerability. Developers can include the generated e2e test in their regression test suite.

3.1 Operation Model

Acto models an operation as a pair, (S^c, D), where S^c denotes a current system state and D is a declaration of a valid desired state. D is constrained by the operation interface specification (CRD [2] in Kubernetes). If successful, an operation triggers a state transition, S^c to S^D , where S^D satisfies D. D often only specifies a (small) part of the system state. So, there are multiple possible system states that can satisfy D, and, in practice, only a small part of S needs to be examined to check if S^D satisfies D.

If an operation fails (e.g., due to bugs in operator code), the system enters an error state, S^e, which does not satisfy the desired state D. When S^e does not satisft D, the operator should be able to recover the system back to the previous healthy state from S^eby means of a state transition using the desired-state declaration D_i-1 that previously triggered the transition to S^c.

The fundamental challenge in testing operators is the prohibitive cost of testing all elements in the Cartesian product of S = S^C ∪ S^E and Ď, where S^C is the set of all possible valid system states (S^c ∈ S^C ), S^E is the set of all possible error states (S^e ∈ S^E), and Ď is the set of all possible declarations of desired state (D ∈ Ď). There can be a large number of values for different properties that constitute the system state. Exhaustive testing could be prohibitively expensive, and any practical testing approach can only exercise a part of the state space, i.e., S × Ď.

3.2 Test Strategy

Acto systematically explores the state space using the following three test strategies (Figures 3a–c).

Figure 3: State transitions of different test strategies.

Single operation. Acto generates a declaration of a desired state D, triggers the operator to reconcile the current system state S^c to the desired system state S^D, and checks whether S^D |= D. The single operation is applied to the initial system state S^c = S⁰ (starting from a non-initial state requires more operations). The key challenge is how to explore an effective and representative subset of Ď.

Operation sequence. Acto extends single operations into a test campaign, which consists of a sequence of operations. Test campaigns overcome the limitation of the single-operation strategy, which must always start from the initial state S^c = S⁰. It is important to test whether an operator can reconcile the system to desired states from different, non-initial start states. Reaching an end state from different start states increases the chance of invoking different procedures in the operator code. In a test campaign, earlier operations take the system to new states which become the start states for subsequent operations.

Acto generates a test campaign by chaining the expected end states {S_i} from the single-operation strategy, and generating a new D_i after each successful reconciliation, as shown in Figure 3b. The result is a sequence of state transitions; after each transition Acto checks whether the expected end state S_i satisfies the desired state D_i.

Error-state recovery. The operation-sequence strategy does not test whether an operator correctly restores a system from implicit or explicit error states. If the system is in an error state S^e, the operator is responsible for recovering from S^e by reconciling the system from S^e back to the prior healthy state S_i-1. The subsequent operations start from S_i-1, such as in the transition from S_i-1 → S_i+1, in Figure 3c. Error states can be reached because of operator bugs that reconcile the system to a state S^e which does not satisfy desired state D, or misoperations—semantic errors in D that escape syntactic validation against the interface specification.

Acto combines these three test exploration strategies (Figures 3a–c) to realize the state transition sequences in one test campaign, as shown in Figure 3d.

3.3 Example

We use the bug in Figure 1 as an example to illustrate Acto’s test strategy. When testing the ZooKeeper operator, as part of the operation sequence, Acto applies D_k (a ZooKeeper CR) that desires five ZooKeeper replicas, triggering the operator to set up a ZooKeeper cluster with five replicas (pods) running. Acto then applies D_k+1 by reducing the desired replica number to three. The operator then scales down ZooKeeper by deleting two pods, but does not delete their volumes due to the bug. Finally, Acto applies D_k+2 that raises the replica number back to five. The operator creates two pods directly reusing the old volumes. Due to the bug, ZooKeeper gets stuck in an error state: the membership configurations on the old volumes are not updated, and the newly created pods keep crashing. Acto flags this bug using its test oracles.

To reproduce this bug without going through all the operations, Acto generates a minimized operation sequence that deterministically triggers the bug.

4 Design and Implementation

We describe the main components of Acto and how we implement them. These components embody Acto’s state-centric testing technique; they generate declarations of desired system states, execute test campaigns, and check reconciled states using automated test oracles.

The Acto tool takes the following inputs: 1) a manifest for deploying the operator, 2) the specification of state declaration, i.e., the operator’s CRD [2], and 3) optionally the operator’s source code. Acto outputs test results, debugging information, and minimized test code that reproduces detected failures. Acto runs tests on virtualized Kubernetes clusters. It supports three backends: Kind, Minikube, and K3d.

4.1 Realizing State Transitions

During a test campaign (Figure 3d), Acto automatically generates a new state declaration D_i+1 based on the current system state S_i to realize a state transition from S_i → S_i+1. Test campaigns start from the initial state S₀. Acto triggers state transitions with the goals to 1) cover all properties exposed by the operation interface, and 2) exercise representative operation scenarios based on property semantics.

Acto systematically exercises all the properties that are defined in the operation interface. Each new D_i+1 changes one property in the current state S_i and any other properties that are needed to satisfy predicates on property relationships. Specifically, Acto selects a previously untested property and uses it to declare a new desired state. The end state after one transition becomes the start state for the next transition (Figure 3b). All state declarations collectively change every property at least once during a test campaign.

Acto tests different scenarios based on the semantics of the changed properties. (Acto automatically infers these semantics). Table 1 gives a few such scenarios. For example, Acto tests the scale-up-and-scale- down and the scale-down-and-scale-up sequences if a property represents the number of replicas. Acto also tests different pod assignments that trigger the operator to re-configure or re-deploy managed systems differently. This scenario-driven approach allows Acto to focus on a small number of representative states, instead of the very large set of all possible property values. We implement the scenarios as plugins that can be extended or customized; users of Acto can implement more scenarios and support system-specific properties such as system configurations.

Property	Scenarios
Replicas	Scale up and then down; scale down and then up; upscale over system resource limit.
Affinity	Place all pods on one node; spread pods to different nodes; set unsatisfiable affinity rules.
Storage	Expand storage volumes; shrink storage volumes; request more storage than is available in a cluster.
Access	Switch between normal and privileged roles.

Table 1: Examples of built-in scenarios of Acto to generate new state declarations and trigger state transitions. Scenarios are created based on property semantics inferred by Acto and they can be extended or customized.

Acto also generates misoperations, each of which triggers a state transition to an error state, S^e. For ex- ample, Acto generates misoperations that 1) scale the replicas beyond the total number of available physical resources, and 2) set unsatisfiable affinity rules (Table 1). Acto uses misoperations to check if an operator 1) is resilient to operation errors and 2) can recover from undesired or error states. Acto’s oracles check the former (is the system in a state S^e?). Acto checks the latter by rolling back S^e to the most recent healthy state. Misoperations that declare semantically erroneous states could escape constraint validation. A correct operator should not carry out an erroneous operation or at least should recover from operation failures.

4.2 Generating State Declarations

Acto generates desired-state declarations, D ∈ Ď , that are syntactically valid, resemble real-world scenarios, and satisfy predicates on property relationships. Such desired states improve the effectiveness and efficiency of Acto’s state space exploration. End-to-end tests are expensive, so a D that does not satisfy these conditions has a low chance of finding bugs.

Acto ensures that all property values in declared desired states are syntactically valid using the opera- tion interface specification. (Invalid declarations would likely be directly rejected by the Kubernetes API servers before reaching the operator.) Kubernetes’ OpenAPISchema specification defines constraints on all supported properties. For composite properties, Acto uses composite constraints like required properties and also derives constraints from the sub-properties. For primitive properties, Acto uses constraints like the type, min/max values (for numeric types), length (for string type), regular-expression patterns, etc.

To exercise various operation scenarios, Acto changes properties based on their semantics. Acto in- fers the semantics of a property in the interface specification by mapping it to a set of resource types in the Kubernetes core APIs. Such mapping is feasible because many operations for property changes are eventually delegated to Kubernetes core services. Acto exploits the insight that property structure is ef- fective for mapping to properties in the Kubernetes core resource specification. Specifically, all Kuber- netes core resource types have unique structures. Figure 4 exemplifies how Acto infers semantics from the property structure: CassOp has a cassandraDataVolumeClaimSpec property with the same structure as the VolumeClaimTemplates property in Kubernetes’ StatefulSet resource. Therefore, Acto infers the semantics of cassandraDataVolumeClaimSpec using a structural mapping. When provided with operator source code, Acto can obtain more complete mapping via static program analysis that tracks how the property value is used in the operator code via its data flows.

To generate values for properties with inferred semantics, Acto currently implements 57 property- specific generators based on Kubernetes resource semantics. Most of these properties are composite. The generators focus on high-level semantics to exercise different scenarios (Table 1). Each generator creates property values to realize a scenario. We find that most properties exposed by operation interfaces (83% on average in our evaluated operators) can be mapped to Kubernetes resources. For properties whose se- mantics Acto cannot infer, Acto mutates current values based on their data types while satisfying syntactic constraints. Acto only mutates primitive sub-properties of composite properties.

Lastly, the values Acto generates should satisfy predicates, in the form of property dependencies, for changed property values to trigger state transitions. Acto automatically infers property dependencies from naming convention. In Kubernetes, dependencies can be identified by feature toggles—each composite property has a Boolean sub-property named “enabled.” For example, operations that change PCN/MongoOp’s backup policy must also set Backup.Enabled to True. With operator source code, Acto can also detect dependencies among property values by analyzing control-flow relationships among program variables.

Figure 4: Semantic analysis maps the properties in the CRD interface to the properties of a Kubernetes core resource.

4.3 Test Oracles

Acto’s test oracles check if the system state after an operation matches the desired state. If there is a match, Acto reports the operation as successful. Otherwise, Acto signals an alarm that the user can inspect to find bugs. The complexity of Acto’s oracles depends on whether mismatches between reconciled and desired states manifest explicitly or implicitly. Acto implements oracles to check for state mismatches that manifest as explicit error states, such as exceptions, error codes, and timeouts.

Acto also implements oracles to check if S_i satisfies D_i for each state transition, as many operator bugs manifest as implicit-state mismatches with no explicit symptoms. Checking whether S_i satisfies D_i is challenging. First, S_i and D_i are represented differently: D_i is a specification [2] and S_i is embodied in state objects [4]. Second, satisfiability is domain-specific; its semantics may not be obvious. To address these challenges, Acto devises the consistency oracle and differential oracle.

In addition, Acto also has an interface to allow users to add custom oracles with domain-specific knowledge, e.g., a probe that tries to set and get some path in ZooKeeper.

4.3.1 Consistency Oracle

Some bugs occur if an operator stops reconciliation because the system is in state S_i which satisfies D in the operator’s view, but which does not satisfy D in Kubernetes’ view. To detect such bugs, Acto additionally checks whether the Kubernetes’ view matches D; the Kubernetes’ view is encoded in spec sections of state objects, which are jointly maintained by all running controllers and operators. For each transition from S_i−1 → S_i, Acto attempts to match each property p (specified in D_i) to the corresponding spec fields in the state objects. If a match is found, it indicates that Kubernetes agrees with the operator. Otherwise, Acto raises an alarm.

4.3.2 Differential Oracle

The differential oracle does not check against D_i; it checks that an operator 1) reconciles to the matching desired states from different existing states S_i−1 and S₀, and 2) recovers the system from (implicit or explicit) error state S^e to state S_i−1. Acto rolls back to S_i−1 to continue exploration from a known good state. Figure 5 shows a bug detected by the differential oracle. There, the Boolean KnativeOp property contour.enabled enables or disables Contour (an ingress controller). But, a KnativeOp bug makes it fail to disable Contour once it is enabled. The consistency oracle does not detect this bug: it is hard to automat- ically map the Boolean property to the existence of a Contour pod. The differential oracle detects the bug because a Contour pod appears in S_i, but not in S'_i.

Note that reporting alarms for any difference in the state objects of S_i and S'_iwould be brittle and lead to false positives, because execution-specific values like timestamps, IP addresses, and ports may change nondeterministically. Acto excludes execution-specific fields when comparing state objects. Acto automatically labels those fields by 1) running the transition S₀→ S₁ multiple times as a calibration and labeling fields with values varying across runs, and 2) running S₀→ S₁ multiple times, iff the differential oracle fires an alarm on S_i, to ensure relevant fields are deterministic.

Figure 5: A KnativeOp bug that is detected by Acto’s differential oracle [5]. Contour continues to manage ingress after an operation explicitly disables it.

5 Evaluation and Experience

In our original SOSP paper [14], we rigorously evaluated Acto with eleven popular open-source Kubernetes operators which manage nine cloud systems. All evaluated operators were developed by the official teams of the managed systems, or by companies that sell services built around the managed systems. Acto found new bugs in every evaluated Kubernetes operator, and in total found 56 unknown bugs in all the evaluated operators. We had reported all these bugs. At that time, 42 were confirmed and 30 have been fixed. No bug report was rejected. Acto also finds six bugs in Kubernetes and in the Go runtime that affect multiple operators; all were confirmed or fixed.

Since then, we have been continuously developing the Acto project, and Acto has been used to test more Kubernetes operators. Recently, we designed an assignment on Kubernetes controller reliability based on the Acto project for CS 523 (Advanced Operating Systems) at the University of Illinois Urbana-Champaign and used it in the semester of Spring 2024, with the purpose of teaching cloud computing concepts and cloud-native technologies. Many students in the course have successfully applied Acto to more than 40 open-source Kubernetes operators. Most students find Acto easy (and fun) to use and effective in finding defects in existing operators. Students are encouraged to report the bugs they find back to the developers. So far, Acto has helped find more than 80 new bugs (at least 62 were confirmed and 41 have been fixed). The project maintains the list of bugs found by Acto [8].

During the process, students continuously improve Acto and add new features. For example, Acto now has support for Kubernetes operators written in Java and Rust, in addition to Go. Acto also starts to support simple crash testing [24]. Acto has also been used in other research projects on Kubernetes reliability. For example, Acto was used to empirically evaluate formally verified Kubernetes controllers [25].

The original Acto paper is available at https://github.com/xlab-uiuc/acto/blob/main/docs/acto- paper_sosp2023.pdf.

Appendix

References:

[1] Cloud Native Computing Foundation Operator White Paper. https://www.cncf.io/wp-content/uploads/ 2021/07/CNCF_Operator_WhitePaper.pdf.

[2] Custom Resources. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom- resources/.

[3] Operator Pattern. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/.

[4] Understanding Kubernetes Objects. https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes- objects/.

[5] Contour pod is not deleted when disabled by user. https://github.com/knative/operator/pull/1176, 2022.

[6] Kubernetes and cloud native operations report 2022. https://juju.is/cloud-native-kubernetes-usage- report-2022#kubernetes-operators, 2022.

[7] Zookeeper pod keeps crashing when scaling down and up. https://github.com/pravega/zookeeper- operator/pull/526, 2022.

[8] Bugs found by Acto. https://github.com/xlab-uiuc/acto/blob/main/bugs.md, Apr. 2024.

[9] BROWN, A. B., AND PATTERSON, D. A. Undo for Operators: Building an Undoable E-mail Store. In Proceedings of the 2003 USENIX Annual Technical Conference (ATC’03) (June 2003).

[10] BURNS, B., GRANT, B., OPPENHEIMER, D., BREWER, E., AND WILKES, J. Borg, Omega, and Kubernetes. Communications of the ACM 59,5 (May 2016), 50–57.

[11] CEBULA, M., AND SHERROD, B. 10 Weird Ways to Blow Up Your Kubernetes. In KubeCon North America (Nov. 2019).

[12] DOBIES, J., AND WOOD, J. Kubernetes Operators: Automating the Container Orchestration Platform. O’Reilly Media, Inc., 2020.

[13] GRAY, J. Why Do Computers Stop and What Can Be Done About It? Tandem Technical Report 85.7 (June 1985).

[14] GU, J. T., SUN, X., ZHANG, W., JIANG, Y., WANG, C., VAZIRI, M., LEGUNSEN, O., AND XU, T. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP’23) (Oct. 2023).

[15] GUILLOUX, S. Writing a Kubernetes Operator: the Hard Parts. In KubeCon North America (Nov. 2019).

[16] HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC’10) (June 2010).

[17] KUMAR,H., AND ŠAFRÁNEK, J. Storage on Kubernetes - Learning From Failures. In KubeCon North America (Nov. 2019).

[18] LAGRESLE, M. Moving to Kubernetes: the Bad and the Ugly. In ContainerDays (June 2019).

[19] MADHU,C. Preventing Controller Sprawl From Taking Down Your Cluster. In KubeCon North America (Oct.2022).

[20] NAGARAJA, K., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. D. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04) (Dec. 2004).

[21] OPPENHEIMER, D., GANAPATHI, A., AND PATTERSON, D. A. Why Do Internet Services Fail, and What Can Be Done About It? In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS’03) (Mar. 2003).

[22] RATIS, P. Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform. In SREcon21 (Oct. 2021).

[23] SUN, X., CHENG, R., CHEN, J., ANG, E., LEGUNSEN, O., AND XU, T. Testing Configuration Changes in Context to Prevent Production Failures. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20) (Nov. 2020).

[24] SUN, X., LUO, W., GU, J. T., GANESAN, A., ALAGAPPAN, R., GASCH, M., SURESH, L., AND XU, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22) (July 2022).

[25] SUN, X., MA, W., GU, J. T., MA, Z., CHAJED, T., HOWELL, J., LATTUADA, A., PADON, O., SURESH, L., SZEK- ERES, A., AND XU, T. Anvil: Verifying Liveness of Cluster Management Controllers. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24) (July 2024).

[26] XU, T., JIN, X., HUANG, P., ZHOU, Y., LU, S., JIN, L., AND PASUPATHY, S. Early Detection of Configuration Errors to Reduce Failure Damage. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (Nov. 2016).

[27] XU, T., ZHANG, J., HUANG, P., ZHENG, J., SHENG, T., YUAN, D., ZHOU, Y., AND PASUPATHY, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP’13) (Nov. 2013).

Article Categories:

SRE

Distributed systems

Sysadmin

Last updated August 6, 2024

Authors:

Jiawei Tyler Gu is a PhD candidate in the Computer Science Department of the University of Illinois Urbana-Champaign. His research focuses on improving the reliability of cloud system management.

jiaweig3@illinois.edu

Xudong Sun is a fifth-year Ph.D. student in the Computer Science department at the University of Illinois Urbana-Champaign (UIUC). His research focuses on improving the reliability of modern distributed systems using systematic testing and formal verification.

xudongs3@illinois.edu

Zhen Tang is a Master’s student in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He has a broad interest in computer systems, particularly distributed systems. Currently, he is focusing on the reliability of cloud system management.

zhent6@illinois.edu

Chen Wang is a Senior Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI & LLM systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes & CNCF contributor, and a KubeCon speaker. She obtained an MS and a Ph.D. in Electrical & Computer Engineering from Carnegie Mellon University (CMU).

Chen.Wang1@ibm.com

Mandana Vaziri is a Principal Research Scientist at IBM T.J. Watson Research Center. She received her PhD from MIT in 2004 working in the area of formal methods. Her thesis entitled "Finding bugs in software with a constraint solver" won the 2016 ACM SigSoft Impact Paper Award. At IBM, she has worked on a variety of projects in Programming Languages and Software Engineering: X10DT (the Eclipse-based IDE of the X10 programming language), ActiveSheets (a spreadsheet streaming programming language), SwaggerBot (a tool to obtain chatbots from API specifications), Kubernetes Operators (IBM Cloud Operator, Composable Operator), operator testing, LLM-powered software assistants, among others. Recently, she has been working in the area AI for code, and a prompt programming language for leveraging LLMs in software engineering.

Owolabi Legunsen is an assistant professor in the Department of Computer Science at Cornell University, where he is a member of the software engineering research group. His research is on improving software testing and runtime verification, and on unifying both approaches.

legunsen@cornell.edu

Tianyin Xu is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). His research focuses on building reliable computer systems that empower next-generation cloud and datacenter computing.

tyxu@illinois.edu