Anvil: Building Kubernetes Controllers That Do Not Break

June 17, 2024

Research

Authors:

Xudong Sun, Jiawei Tyler Gu, Cody Rivera, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Adriana Szekeres, Tianyin Xu

Article shepherded by:

Rik Farrow

Today’s cluster managers, like Kubernetes, keep the beating heart of datacenters running. These managers are architected as a collection of loosely-coupled controllers and follow the state-reconciliation principle: each controller’s job is to monitor the actual cluster state and its input desired state, then modify the cluster state until it is reconciled with the desired state; a controller builds on top of underlying controllers by modifying their input desired states.

The Kubernetes ecosystem has thousands of such controllers available. Each Kubernetes deployment comprises an appropriate subset of controllers, a flexible modularity enabled by the loosely-coupled architecture.

Cluster management is challenging because of so many independently-authored controllers interacting asynchronously and concurrently. Even more challenging is that failures may interrupt a controller’s operations. In principle, each controller should wake up from failure, observe that the current state does not match the desired state, and proceed to reconcile. In practice, a very common bug pattern is for a controller’s correct operation to depend on its internal state (stack frame or program counter); if a failure occurs, that state is lost, and the controller logic is unable to recognize that it has left its task incomplete.

This kind of failure is called a “liveness” bug: the controller does not do anything immediately apparent as wrong (which would be a safety violation), but it never gets around to doing the expected right thing. We studied controller failures and found that this form of failure is dominant in the controller ecosystem. Recent testing techniques [3, 10] can reveal such bugs using automatically generated tests and fault injections, but testing cannot confirm a controller’s correctness.

Our goal is to confirm statically, at compile time, that a controller always gets around to doing the right thing, even in the presence of asynchrony, concurrency and failures. To this end, we present Anvil, a framework for developing controllers and proving their correctness. We have used Anvil to implement and verify three practical Kubernetes controllers that can readily be deployed on real-world Kubernetes platforms. Anvil and the verified controllers are available at https://github.com/vmware-research/verifiable-controllers.

How controllers work and why they break

Modern cluster managers like Kubernetes follow the state-reconciliation pattern, where controllers repeatedly attempt to reconcile between the current cluster state and some desired state [2]. We use Kubernetes as a representative example to present the basics of state reconciliation. In Kubernetes, the cluster state is represented as data objects, and all cluster management logic is implemented in a fleet of controllers, running as microservices. Figure 1 illustrates Kubernetes’ architecture. Kubernetes’ core comprises an ensemble of API servers and a highly available, strongly consistent data store (e.g., etcd) that houses the cluster state. The cluster state is represented by a collection of objects. Every entity in the cluster has a corresponding object in the cluster state, including pods, volumes, nodes, and groups of applications. All controllers in Kubernetes interact with the cluster state via API servers.

Figure 1: An overview of Kubernetes’ architecture.

Each Kubernetes controller runs a control loop that continuously reconciles the cluster’s current state to the desired state. At each loop iteration, a reconciliation procedure checks whether the current cluster state matches the desired state; if not, it performs corrective operations to move the cluster towards the desired state (e.g., launching new replicas in an ensemble of servers when existing replicas fail). The operations query or update the cluster state objects via API servers. The desired cluster state is described declaratively, represented as an object, and can be dynamically updated during the lifecycle of a running controller. The reconciliation procedure is typically implemented in a reconcile() function, which is invoked whenever the desired state description (or its relevant cluster states) is changed.

This design enables Kubernetes to be highly extensible: supporting a new application or feature is a matter of adding a new controller and a corresponding custom object type to describe the desired state; it does not require changes to API servers or existing state objects. Kubernetes’ extensibility has led to the emergence of a thriving ecosystem of thousands of custom controllers that have been developed by commercial vendors and open-source communities.

Figure 2 exemplifies the reconciliation process of a Kubernetes controller for managing ZooKeeper. To create a ZooKeeper cluster, the controller takes three steps to create: (1) a networked service (a Kubernetes service object), (2) a ZooKeeper configuration (a config map object), and (3) a stateful application (a stateful set object) with three replicas. Each step is performed by creating a new state object of the corresponding resources via the API server, which then triggers Kubernetes built-in controllers, e.g., the StatefulSet controller will create three sets of pods and volumes to run containerized ZooKeeper nodes. In the end, the cluster state matches the desired state. Later, if the desired state changes (e.g., its number of replicas is increased), the ZooKeeper controller will start a new iteration of reconciliation that updates the stateful set object to scale up the ZooKeeper cluster.

Figure 2: An example of state reconciliation of an unverified ZooKeeper controller. A liveness bug is triggered by a crash during reconciliation. The bug pattern is common in real-world Kubernetes controllers (known as intermediate-state bugs [10]).

Controller correctness is challenging

A bug in a controller’s reconciliation can result in the controller never being able to match the desired state, even when reconcile() is called repeatedly—a liveness violation. Controllers are expected to be level-triggered [5]: reconcile() can be called from any current cluster state to match any given desired state, with no guarantee that the controller has seen the entire history of cluster state changes [12]. In addition, controllers run within dynamic and complex environments and thus must tolerate unexpected failures, network interruptions and concurrency and asynchrony issues while running reconcile().

Figure 2 shows one of many bug patterns of controllers [3, 8, 9, 10]. If the controller crashes between steps (1) and (2) during an execution of reconcile(), Kubernetes will reboot the controller. The freshly invoked reconcile() call now faces the intermediate state created by the previous failed execution (step (1)). However, in this case, the controller would never perform step (2) due to a buggy predicate, which only checks whether the networked service exists, but not whether the config map also exists.

The enormity of cluster state space and the complexity of failure events lead to a state-space explosion problem, making it prohibitively expensive (if not impossible) to test a controller’s behavior under all possible scenarios. In particular, liveness violations are notoriously hard to detect because a testing tool, by observing a controller’s external behavior, cannot decide whether the controller will never realize the desired state or is just running slow.

Anvil’s Approach

Anvil is a framework that allows developers to confirm statically, at compile time, that a controller performs state reconciliation correctly. Anvil uses formal verification—an approach that confirms the correctness of a program by proving that the program always adheres a correctness specification.

With Anvil, developers implement the controller’s state-reconciliation logic in Rust, write a specification and a proof to show that the controller correctly implements the specification; the proof is mechanically checked to ensure that the proof itself is correct.

Implementing Controllers

In Anvil, developers implement a controller using a state machine to enable formal verification. Figure 3 shows a snippet of the Anvil Controller API specified using a Rust trait: it involves defining the initial state and the transitions of a state machine.

Anvil’s reconcile() uses the state machine: it starts from the initial state and invokes step() iteratively until all steps are done or if any step encounters an error. Each iteration of step() returns the next state in the state machine, together with an external request. The external request is typically a REST call to Kubernetes API servers, but can also be extended to non-Kubernetes APIs. The response to the external request is passed as an argument to the next iteration of step(). Note that the API enforces no more than one external request per step(), making the state-machine transition atomic with respect to cluster-state changes. Anvil’s reconcile() interfaces a trusted Kubernetes client library (kube-rs [1]) which invokes reconcile() upon changes, handles its output, and requeues the next invocation.

pub trait Controller { type D; // desired cluster state description type S; // local state in the state machine // Returns the initial local state (in the state // machine) of every reconcile() fn initial_state() -> S; // Returns S: next local state in state machine // Req: external request (e.g. to Kubernetes) // # Arguments // * d: the desired cluster-state description // * r: response to the request from last step // * s: current local state in state machine fn step(d: &D, r: Resp, s: S) -> (S, Req); // Returns true if all steps are done fn done(s: &S) -> bool; // Returns true for error states fn error(s: &S) -> bool; }

Figure 3: Anvil’s basic Controller API. To implement a controller, developers implement the Controller trait.

Figure 4 shows the step() implementation of a ZooKeeper controller (Figure 2). The state machine starts from the CheckService state, where it returns a request to read the service object from the API server (service_get_req) and the next state to transition to ReconcileService. The reconcile() method fetches the service object by calling the API server, and moves on to the next iteration of step(), bringing the state machine to the ReconcileService branch. The controller proceeds to create or update the service based on the response of service_get_req. In this way, the controller progressively reconciles each cluster-state object and eventually matches the desired state declared by ZKD.

fn step(d: &ZKD, r: Resp, s: ZKS) -> (ZKS, Req) { match s { CheckService => { // if the service exists let service_get_req = KubeGet { ... } return (ReconcileService, service_get_req); } ReconcileService => { // create/update the service based on response r if r.is_ok() { let service_update_req = ...; return (CheckConfigMap, service_update_req); } else if r.is_not_found() { let service_create_req = ...; return (CheckConfigMap, service_create_req); } else { return (Error, Noop); // restart reconcile() } } CheckConfigMap => { ... } ReconcileConfigMap => { ... } CheckStatefulSet => { ... } ReconcileStatefulSet => { ... } ... } // more step branches are omitted }

Figure 4: A simplified implementation of step() using Anvil for creating a ZooKeeper cluster.

To verify the controller, developers need to first write a correctness specification that they want the controller to implement. However, controllers, unlike some other systems (e.g., consensus systems or storage systems), do not have well-known protocols or properties that formally define their correctness. How do we formally specify controller correctness?

Specifying Controller Correctness

We formalize controller correctness as eventually stable reconciliation (ESR), a specification that captures key properties of state reconciliation. We follow the principle that the specification must be: (1) powerful enough to preclude a broad range of bugs, (2) generally applicable to diverse controllers, and (3) concise enough for manual inspection.

ESR captures two key properties of any controller’s state reconciliation behavior: (1) progress: given a desired state description, the controller must eventually make the cluster state match that desired state (unless the desired state changes), and (2) stability: if the controller successfully brought the cluster to the desired state, it must keep the cluster in that state (unless the desired state changes).

We formalize ESR as a TLA (temporal logic of actions [6]) formula. In TLA, the behavior of the state machine is captured by its set of traces, infinite sequences of system states. The ESR formula should hold for all traces of the controllers’s execution under all possibilities for asynchrony, concurrency, and failures. We use d to denote a state description, desire(d) to denote whether d is the current description of the desired state (the controller’s input), match(d) to denote whether the current cluster state matches the description d. Our definition of ESR is given by the following formula:

Equation 1: The ESR property formalized in TLA.

Informally, ESR asserts that if at some point the desired state stops changing, then the cluster will eventually reach a state that matches it, and stay that way forever. The universal quantifier ∀ (for-all) is used to assert on all possible state descriptions. For example, ∀d.P(d) holds if and only if P(d) holds for all possible d. The temporal operators ↝ (leads-to) and ☐(always) are used in temporal logics to reason about the future of an execution trace. If predicates P and Q talk about the current state, then ☐P says that P holds in the current state and all future states, while P ↝ Q says that if P holds at some state then Q holds in the same state or some future state. Temporal logics such as TLA also allow nesting of temporal operators; for example, P ↝ ☐Q means that if P holds then eventually we get to a point such that from that point onwards, Q always holds.

The formalization of ESR captures the key correctness properties shared by virtually all controllers: progress and stability. We elaborate on this with a detailed dissection of Equation 1. The innermost conclusion of the formula is ☐match(d) after ↝, which states that eventually the controller matches the desired state (progress), and from then on, it always keeps the cluster state at the desired state (stability). Before ↝, ☐desire(d) is a realistic and necessary premise for the controller to match the desired state—if the desired state description keeps changing forever, the controller will keep chasing a moving target forever, and nothing can be guaranteed as we do not wish to assume a bound on how long state reconciliation takes. Finally, the ∀d states that the controller reconciles all desired state descriptions.

ESR precludes any bug that prevents the controller from eventually making the cluster state match the desired state stably. Figure 5 illustrates the ESR definition in some examples that violate or satisfy ESR: (a) violates progress because the cluster state never matches d, (b) violates stability because the cluster state first matches d but then deviates from d, (c) satisfies ESR because the cluster state eventually matches and always matches d₂, and (d) vacuously satisfies ESR because the desired state never stops changing, so ☐desire(d) does not hold for any fixed d.

Figure 5: Executions that violate or satisfy ESR.

The verification goal for each controller is to prove that the controller satisfies ESR—all possible executions of the controller satisfy ESR. We use model to describe all possible executions of the controller that runs in an environment with asynchrony, concurrency and faults. Then the statement that the controller satisfies ESR is formalized as:

Equation 2: The theorem that the controller satisfies the ESR property.

The ⊨ symbol in this context means that any execution that the controller might produce satisfies ESR.

Strictly speaking, ESR only guarantees one successful state reconciliation—the one that happens after the desired state stops changing forever. However, in practice the controller has no way of knowing if the desired state will change in the future or not. Therefore, we can expect that a controller that satisfies ESR will bring the cluster to match the desired state (and keep it like that) for any desired state that remains unchanged for long enough.

Anvil provides a TLA library that makes writing TLA formulas similar to writing code, inspired by previous work [4]. Figure 6 presents the ESR theorem written using Anvil. The definition of match varies across controllers; e.g., the match(d)for the ZooKeeper controller in Figure 4 checks if the service, config map and stateful set exist in the data store and match the desired state description d. Note that Anvil is not limited to ESR; developers can specify and prove other controller properties using Anvil.

model.entails(forall(|d| always(desire(d)).leads_to(always(match(d)))))

Figure 6: The ESR theorem written using Anvil’s TLA library.

With the controller implementation and specification ready, developers need to write a proof to show that the controller correctly implements the specification. To ensure the correctness of the proof itself, the proof is mechanically checked by the Verus verifier [7].

Proving Controller Correctness

Proving that a controller implements ESR requires reasoning about how the controller makes progress towards the desired state and then keeps the cluster state at the desired state stably.

To reason about the controller’s progress, most of the proof involves demonstrating that if condition P holds then eventually Q holds (i.e., P ↝ Q). For example, if the controller sends a request, then eventually the request is received and handled by the API server. Proving P ↝ Q is typically done by applying the WF1 proof rule [6] (WF stands for “weak fairness”) that informally states that “Action A makes P lead to Q” with the following requirements: (1) running A in a state satisfying P makes Q hold in the next state, (2) P holds until A runs, and (3) if A remains enabled (A can possibly occur), then A eventually occurs; this condition is referred to as weak fairness.

Proving that the controller keeps the cluster state at the desired state stably is done by induction: if the cluster state currently matches the desired state, then no matter what action happens next, the cluster state will still match the desired state.

To reduce the developers’ burden on proof, Anvil provides many proof utilities including a model of the controller environment, a set of reasonable assumptions on the environment (e.g., weak fairness), and reusable lemmas encoding temporal proof rules (e.g., WF1) that can be directly assembled into developers’ ESR proofs. In addition, Anvil comes with a general ESR proof strategy. The proof strategy structures the proof by dividing the proof goal into several more approachable lemmas. We refer readers interested in more details about writing proofs to our OSDI’24 paper [11].

Our Experience with Anvil

We have used Anvil to build three verified Kubernetes controllers for managing different applications and services (ZooKeeper, RabbitMQ, and FluentBit). For each controller, we use a mature, widely used controller as a reference (either the official Kubernetes controller of the applications or from companies that offer related products).

We aim to implement verified controllers that are feature rich with production quality. For the ZooKeeper and RabbitMQ controllers, we implement key features offered by the reference controllers including scaling, version upgrading, resource allocation, pod placement, and configurations, as well as network and storage management. For the FluentBit controller, we implement all the features offered by the reference controller. All the verified controllers can readily be deployed in real-world Kubernetes platforms and manage their respective applications.

For verification, we spent around two person-months on verifying ESR for the ZooKeeper controller, during which we developed the proof strategy. We took much less time (around two person-weeks) to verify the other two controllers using the same proof strategy and similar invariants. We find Anvil’s ability to formally verify a controller’s implementation invaluable; we discovered four deep bugs via verification. Some of the bugs also exist in the reference controllers but were not detected by testing [3, 10].

Table 1 shows the efforts to implement and verify each controller using Anvil. We also measured the controllers’ state-reconciliation time and the verified controllers achieve competitive performance compared to the unverified references.

Controller	Trusted (LoC)	Verified Impl (LoC)	Proof (LoC)
ZooKeeper	950	1134	8352
RabbitMQ	548	1598	7228
FluentBit	828	1208	8395

Table 1: Code sizes of the controllers verified using Anvil. Trusted includes the (verified) theorems, trusted assumptions and unverified implementation.

We are continuously improving Anvil and building more verified controllers using Anvil. Currently, we are building verified Kubernetes built-in controllers, including ReplicaSet, Deployment and StatefulSet controllers. In future work, we aim to gradually replace existing (unverified) controllers with verified controllers using Anvil, including both custom and built-in ones.

Concluding Remarks

Kubernetes controllers are getting increasingly important but their correctness is challenging. Anvil is a framework for developing Kubernetes controllers and confirming their correctness using formal verification. Our work shows that it is not only feasible but also pragmatic to implement, verify, and maintain practical Kubernetes controllers. We hope that Anvil leads to a practical path towards provably correct cloud infrastructures.

More details are available in the full Anvil paper and Anvil's documentation.

Appendix

References:

[1] kube-rs/kube: Rust Kubernetes client and controller runtime. https://github.com/kube-rs/kube, 2023.

[2] Burns, B., Grant, B., Oppenheimer, D., Brewer, E. and Wilkes, J. Borg, Omega, and Kubernetes. Communications of the ACM 59, 5 (May 2016), 50–57.

[3] Gu, J.T., Sun, X., Zhang, W., Jiang, Y., Wang, C., Vaziri, M., Legunsen, O. and Xu, T. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP’23) (Oct. 2023).

[4] Hawblitzel, C., Howell, J., Kapritsos, M., Lorch, J.R., Parno, B., Roberts, M.L., Setty, S. and Zill, B. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15) (Oct. 2015).

[5] Hockin, T. Kubernetes: Edge vs. Level Triggered Logic. https://speakerdeck.com/thockin/edge-vs-level-triggered-logic, June 2017.

[6] Lamport, L. The Temporal Logic of Actions. ACM Transactions on Programming Languages and Systems 16, 3 (May 1994), 872–923.

[7] Lattuada, A., Hance, T., Cho, C., Brun, M., Subasinghe, I., Zhou, Y., Howell, J., Parno, B. and Hawblitzel, C. Verus: Verifying Rust Programs Using Linear Ghost Types. In Proceedings of 2023 ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’23) (Apr. 2023).

[8] Liu, B., Kheradmand, A., Caesar, M. and Godfrey, P.B. Towards Verified Self-Driving Infrastructure. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks (HotNets’20) (Nov. 2020).

[9] Liu, B., Lim, G., Beckett, R., and Godfrey, P. B. Kivi: Verification for Cluster Management. In Proceedings of the 2024 USENIX Annual Technical Conference (ATC’24) (July 2024).

[10] Sun, X., Luo, W., Gu, J.T., Ganesan, A., Alagappan, R., Gasch, M., Suresh, L. and Xu, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22) (July 2022).

[11] Sun, X., Ma, W., Gu, J. T., Ma, Z., Chajed, T., Howell, J., Lattuada, A., Padon, O., Suresh, L., Szekeres, A., and Xu, T. Anvil: Verifying Liveness of Cluster Management Controllers. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24) (July 2024).

[12] Sun, X., Suresh, L., Ganesan, A., Alagappan, R., Gasch, M., Tang, L. and Xu, T. Reasoning about modern datacenter infrastructures using partial histories. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII) (May 2021).

Article Categories:

SRE

Distributed systems

Cloud

Last updated June 28, 2024

Authors:

Xudong Sun is a PhD candidate in the Computer Science Department at the University of University of Illinois Urbana-Champaign (UIUC). His research focuses on building reliable computer systems. Recently, he is working on building testing and verification techniques for modern cluster managers (e.g., Kubernetes).

xudongs3@illinois.edu

Jiawei Tyler Gu is a PhD candidate in the Computer Science Department at the University of Illinois Urbana-Champaign (UIUC). His research focuses on improving the reliability of cloud system management.

jiaweig3@illinois.edu

Cody Rivera is a Ph.D. student at the University of Illinois Urbana-Champaign. He is interested in programming languages and particularly formal verification. Some of his work, co-developing a paradigm for predictable automated verification using intrinsic data structure definitions, appears in PLDI. Presently, he is working on systems software verification. More specifically, he is extending Anvil by building verified versions of built-in Kubernetes controllers and adding support for compositional controller verification. He has previously earned an M.S. from the University of Illinois Urbana-Champaign and a B.S. from the University of Alabama.

codyjr3@illinois.edu

Tej Chajed is an assistant professor at the University of Wisconsin-Madison. He received a PhD from MIT in 2022, with a thesis on how to develop a formally verified file system with crash safety and concurrency. His research is on systems verification, aiming to build software systems with mathematical correctness guarantees.

chajed@wisc.edu

Jon Howell is a researcher at VMware Research. He has worked broadly across distributed systems making them more scalable, faster, and most recently, more correct.

jonhdotnet@gmail.com

Andrea Lattuada is a Systems and Verification Researcher. He will join the Max Planck Institute for Software Systems (MPI-SWS) as a Research Group Leader in September 2024. He has worked as a Researcher in the VMware Research Group after obtaining his PhD from ETH Zurich. Andrea has been building formally verified systems, and he develops formal verification tools and techniques. He co-leads the development of Verus, a Rust-based verification tool used both in academia and industry.

andrea@lattuada.me

Oded Padon will join the Weizmann Institute of Science as a faculty member in August 2024. His research spans programming languages, formal verification, and systems. He obtained his PhD in Computer Science from Tel Aviv University and has worked as a postdoctoral researcher at Stanford University and as a Senior Researcher at VMware Research.

oded.padon@gmail.com

Lalith Suresh is the CEO & Co-Founder at Feldera. He was previously a Senior Researcher at VMware Research. His research has spanned topics like distributed systems, networking, databases and operating systems. He holds a PhD in computer science from TU-Berlin.

lalith@feldera.com

Adriana Szekeres is a researcher in the VMware Research Group by Broadcom. Her research interests lie in both theoretical and practical aspects of distributed and parallel systems; more precisely, in finding methods to improve their fault-tolerance, scalability, performance, and correctness. She received her PhD from University of Washington.

adriana.szekeres@broadcom.com

Tianyin Xu is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). His research focuses on building reliable computer systems that empower next-generation cloud and datacenter computing.

tyxu@illinois.edu