You don’t have a production environment. You really don’t. In fact, you don’t have a staging environment either! At least, not in the sense you are used to thinking about these concepts. But before I explain what I mean by those inflammatory comments, we should ask ourselves a more fundamental question: what are “environments” and why do we need several types of them (e.g. staging and production)?
The familiar model of software development starts with the idea of a “production environment” — where the code runs and interacts with real world clients to “produce” value. This model has been inherited from the world of machinery and factories where physical things are produced — hence the name “production”. The idea is that we can separate the system — the thing we are building and maintaining — from the rest of the world: the environment.
The environment is everything that isn’t controlled by the developer, which by definition includes runtimes, operating systems, servers, network, and even clients. Note that this definition is very contextual. For example, are third party libraries part of the system or the environment? Indeed, the question of who is responsible for such components is a constant source of friction between teams. We often settle on natural borderlines provided by build artifacts.
But no matter where we put the border, once we have defined that border we can proceed to rip the system out of the environment where it will eventually run and place it in a different environment which mimics production beyond the border. This is done for various reasons: to isolate developers’ work from each other, to facilitate testing in controlled environments, and to lower costs by using components that are cheaper to run. Organizations typically end up with a progression of environments with increasing degree of fidelity and cost to production: dev, testing, staging, pre-prod, production.
If this progression seems familiar to many outside the software and manufacturing industries, it is because it is based on a fundamental idea that has served scientists and engineers for centuries: the laboratory (or lab for short). The lab is a space carefully isolated from the external world allowing us to control the direct environment of the system we wish to study. The isolation the lab provides also protects the rest of the world from “leakage” of unwanted effects which may disturb other experiments or wreak havoc on the planet (Wuhan speculations aside, this is a very real danger). Simply put, the lab provides realistic feedback in isolation.
The lab concept has been so successful throughout the industrial revolution that we now take it — and the assumptions behind it — for granted. But as science has progressed to explore more complex systems, we have discovered that some systems cannot be studied in a lab. Similarly, in the software industry we have observed a growing number of such failures — often surfacing with the now notorious phrase “but it works on my machine!”. To understand why the lab concept sometimes fails, we need to explore the assumptions it makes:
- There is a clear boundary which separates system and environment
- The boundary can be accurately described and replicated
- The system is deterministic
- System state can be recorded and replicated
While these assumptions hold true in simple cases, they become cost prohibitive at a certain complexity level and scale and some of them simply do not hold no matter how much we are willing to invest.
Consider for a moment what replicating the boundary entails: we need to record and recreate all APIs which may be in use (even those that are implicitly used), all the static data, the exact versions of operating system packages, the configuration files, the network topology, etc. Tools like Docker help to some degree with operating system configuration, but to my knowledge no good solution exists for APIs, static data and network topology — and I know of no company that actually succeeded in recording and replicating them accurately. And even if we could do that, would we actually want to? Remember that APIs often have side effects — like billing a user’s credit card — which is the point of using them! By necessity we settle on a modified boundary — which only superficially resembles the real boundary of the environment — hoping it will be “good enough”.
Well, how about system state? Surely the state of the system which we have created can be easily recorded. Unfortunately, system state is a treacherous beast, scattered all over the place and illusive. The state is composed of the data in the database, but also caches, transient session data, client-side storage and the ever-changing network itself; it is impossible to capture accurately and the problem only grows with scale.
Worse, the assumption that a clear boundary exists does not hold. There is always some leakage between the system and the environment, and that leakage only grows with scale and complexity. Any system that has an interesting impact on the world must interact with the world — and that means input and output. As much as we want that interaction to be supervised and controlled, realistically you end up with some poorly-understood and poorly-managed interactions: we term this leakage. Third party libraries and services are a common source of leakage, but other common sources are telemetry services such as metrics and graphs. In theory such monitoring and observability services should never affect the system, but in practice they do take down systems.
What about determinism? Surely we can trust computers to be deterministic! Well, not quite. Sources of nondeterminism are abundant, such as clocks, random number generators and arbitrary decisions of routers — not to mention the caprices of cloud infrastructure. Even for relatively contained systems like individual servers this can be quite challenging. And of course our systems are never composed of individual servers but rather huge fleets of machines working in clusters, each with its own unique state and sources of nondeterminism.
Let’s assume for a moment we can somehow capture the state of the system (as challenging as that might be) and ask ourselves: would that be enough to recreate a specific client interaction in order to test a feature or recreate a bug with fidelity? Naively, we assume it is enough to have the same system, the same dataset and the client request. But what does it mean to have “the system”? Real world systems change all the time! With server clusters, servers come and go resetting their local state. Therefore, it turns out that a client experiences a different system depending on the exact point in time it interacts with the system and the specific set of servers it happened to talk to.
In addition, we are voluntarily changing the system by deploying new code. In many cases deployments are not atomic: for some time multiple versions can co-exist! Which version did a specific client interact with? And what about A/B tests or feature flags? A client can experience different code paths and different logic even with the same software in place!
In a sense, every client experiences a slightly different system configuration — but, at least in principle, we can recreate it. We would need to log a lengthy configuration vector for each client interaction but this means tracing every transaction in the system which is quite costly. But at least it can be done!
That is, unless your system includes any form of personalization. Personalization algorithms depend on local state and are statistical in nature, making the exact system behavior impossible to capture. Every one of us has a different Twitter experience, even if we follow exactly the same people: and this is inherent in the design.
Given the extreme difficulty of creating and maintaining staging environments, let’s revisit our original goal: to safely get feedback for new features and bugs with high fidelity. But since simulating a production environment accurately is impossible, the value of testing in staging environments quickly declines as the system becomes more complex. How shall we test under realistic conditions?
Let’s look at the problem from a different angle: instead of starting with safety and building towards fidelity, let’s start with fidelity and work towards safety. Clearly, testing in production is the most realistic — but it is not safe. It does not isolate changes under test from the rest of the system and environment. But if we could make it safe somehow, the benefits would be huge: not only will our tests be as realistic as possible, the cost of testing will drop significantly. The need to maintain multiple expensive environments will disappear, and production environments are already — of necessity — built to handle load in a cost effective way.
How can we gain the safety we need for testing in production? As it turns out there is an architectural pattern which is frequently used in modern SaaS systems which can help: multi-tenancy.
Usually implemented for isolating SaaS clients from one another, multi-tenancy is the ability to isolate aspects of the system between logical domains known as “tenants”. The isolation usually encompasses data, state, and system object configurations (e.g. user permissions) and sometimes performance as well (to some degree). With multi-tenancy in place, every developer or automated test run can open its own independent tenant, test whatever needs to be tested in that tenant in isolation, and, finally, remove the tenant. As tenants cost a fraction of the cost of a full environment and can be easily created and destroyed, they can be used as virtual environments, similar to how virtual machines are used. This solves many problems such as the scarcity of testing environments for developers or test isolation and cleanup — but the primary benefit is that it raises the fidelity of tests as they run against the same environment as the one your clients use.
How would that look in practice? Take for example a developer adding a feature to a single microservice; instead of deploying the modified version to a full fledged staging environment or running the service with all of its many dependencies in some dev environment (often the developer’s laptop) the developer can open a tenant for testing, run only the modified service using the production versions of the dependency services. This example is somewhat naive, as features often require modifying many related services. To support system wide changes, we would need to deploy the features protected by feature flags and tie those feature flags to specific tenants, thus allowing the test tenant to access the new features.
Of course, using feature flags cannot be used everywhere. Sometimes there is risk of side effects (perhaps the new feature allocates lots of memory?) or maybe we are making changes to infrastructure — possibly the infrastructure that runs the feature flags system. Fortunately, infrastructure is likely already built from a multitude of basic separate building blocks, separated in various ways for reliability and redundancy. This “cellular architecture” often already hosts multiple tenants (e.g. using Kubernetes) so testing things on top of it is relatively easy using the native “tenant” (often a k8s service). But when we want to test changes to the infrastructure itself we need another kind of tenant, one that relates to the natural architecture of the infra. In this case, our “tenant” would be a set of machines or pods differentiated from the rest of the system not by tagging business data and state but rather by tagging an infrastructure configuration such as runtime, O/S or machine.
Multi-tenancy is a core capability of SaaS systems — and as such needs to be baked into the design of the system. Adding multi-tenancy to an existing system is possible, but can be very challenging. However, there are various forms of multi-tenancy which differ on the degree of isolation, cost and the feasibility of upgrading an existing system. Tenants can be separated at different layers of the system — logically in your code or using virtualization technologies like VMs. The higher the layer in which multi-tenancy is implemented (the closer to application code) the cheaper tenants will be. However, it will also be harder to provide isolation between tenants (for performance and for security), as well as harder to introduce post-hoc. A great example of this is a common datacenter network: traffic can be isolated by different physical wires, VLANs (which are just tagging of ethernet frames), or IP encapsulation (e.g. ip-in-ip or IPSEC). The main differences are where the separation is enforced and how much it costs.
Regardless of the layer in which multi-tenancy is implemented, transaction context must be transmitted throughout the system which includes the tenant ID. This is the biggest challenge in implementing multi-tenancy as it requires propagating transaction context through all protocols, data stores, messaging systems, and other places in your system where state is stored or where interactions occur.
It is worth noting that multi-tenancy isn’t without costs and challenges, some inherent to multi-tenancy itself and some to the layer in which it is implemented. For example, sharing machines between tenants makes both performance isolation and performance monitoring harder, which defeats the purpose of testing performance in production.
Sometimes testing on a multi-tenant system can make testing more expensive — as execution will initiate all operations and side effects of normal production actions, regardless of relevance to the test being run. Multi-tenancy isolates side effects to the test tenant, but you will still have to carry the cost. Consider a test which bills a credit card, you may end up paying yourself but payment processor fees still apply. This also means that unless multi-tenancy extends to every part of your system — which is unrealistic as your system will of necessity include or interact with 3rd party components — you will have side effects which cannot easily be discarded with the deletion of the tenant. For example, BI and audit logs which are stored long term in your data lake may not be deleted due to technical and regulatory reasons (but you can and should filter on tenant IDs in your reports), forcing you to pay for storing useless data of defunct tenants.
Authorization is another pain point, in particular relating to cross tenant operations. This often comes up when your support personnel need to interact with user’s accounts. Impersonation is a problematic solution, but violating tenant isolation by allowing external users can be even worse.
In practice implementing some form of multi-tenancy on all levels may be required, as each layer has its natural tradeoffs. A large enough system will need to support very different types of testing and client behaviors. Although the majority of developers will be satisfied by a single layer, realistically different development teams have different concerns, notably those of developers dealing with the infrastructure itself.
Notably, the biggest challenge of multi-tenancy is due to its inherent strength — it essentially makes your system a flexible platform. This makes the system both more complex internally and externally by promoting varied use cases. Jevon’s paradox pretty much guarantees that once your production environment is multi-tenant it will be used for an ever increasing number of things, precisely because it is easy and safe to do so. This is good news for your (paying) users, but bad news for developers fighting to keep internal coupling and complexity at bay.
Multi-tenancy is a large and complicated topic and it would be impossible to cover all the aspects here; hopefully we will see it covered more and more as time goes by.
For many years we have used a straightforward but flawed approach of maintaining separate environments for development, testing and servicing customers. As our systems become larger and more complex many problems arise which force us to reconsider this methodology and the principles around it. SaaS systems are fundamentally different from old-school run-at-customer-site systems and require multi-tenancy as a core feature. The isolation between tenants and the ease with which tenants can be created and destroyed means that tenants are suitable as virtual environments for testing purposes, providing environments that are cheap to maintain and have perfect fidelity to production.