What Could Possibly Go Wrong?
Being aware of all the things that may break in a complex system and what they will drag down with them is hard. Thinking in terms of failure domains can help.
Let’s set a scene: It's Monday morning, and things are going smoothly. A new top-of-rack switch you ordered two months ago finally came in and will soon double your rack bandwidth and help you reduce latency.
As you prepare to deploy it, there’s something else that’s prudent to keep in mind but often isn’t. If this new switch fails, what will it take down with it? Which components of your infrastructure within and beyond its future server rack would be disrupted? All these components constitute this switch’s “failure domain.” Armed with a correctly defined failure domain for your switch, you can take measures to minimize the risk of a widespread outage caused by any potential problem with the switch—proactively and strategically.
At first glance, determining the switch's failure domain may seem fairly straightforward; and it is, if you're operating a simple computing environment with little or no resource abstraction or redundancy. If, for instance, all of the servers in the rack are directly and physically connected to the switch without any network virtualization, load balancers, firewalls and so on, then you can determine without much analysis that if the switch fails, all servers in the rack will go down with it, as far as your application is concerned. Assuming there are no other resources outside of the rack that depend on the switch or on the servers in the rack, you know that the failure domain is limited to whatever is inside the rack.
More often than not, however, identifying failure domains is not so simple, because modern computing architectures are not so simple. They consist of distributed components and multiple layers of abstraction, which in many cases makes it quite difficult to determine exactly what the impact of a single component's failure may be–not to mention the failure’s severity.
Imagine, for instance, that the rack where your new switch is going hosts servers that are part of a Kubernetes cluster, but that the cluster also includes other servers hosted in other racks. If the new switch in the one rack were to fail, what would happen to your cluster as a whole, and what would happen to the workloads running on it? The answer could range from "Nothing, because the nodes hosted in other racks would be unaffected and would provide enough spare capacity to keep pods running without issue" to "There would be a catastrophic clusterwide failure, because a critical service passes traffic through the switch, and there is no redundancy for that service built into other racks."
If the switch is inside a complex distributed computing environment, the extent of its failure domain depends on how the environment is configured. Calculating the failure domain requires accounting for a variety of components: physical hardware, virtual devices, software, redundancy configurations, potential external resources (like cloud services that integrate with your environment in some way) and more.
While defining failure domains isn’t always easy, it’s a necessary step in creating complex systems that are resilient. This article explains how to make the most of the “failure domain” concept by detailing how it works and offering tips and best practices for applying it effectively as part of a reliability engineering strategy.
What’s a Failure Domain?
Put simply, a failure domain is the set of resources that will be negatively impacted in the event of a particular failure.
A home WiFi router’s failure domain consists of all the devices in the home that rely on it for connectivity. In other words, if the router fails, all those devices will lose their network connections. The failure domain of the modem the router is physically connected to includes all those devices plus the router, plus whatever else is physically connected to it.
Failure Domains and Complex Environments
The failure domains in the above example are simple because the router and the modem are single points of failure in the home network’s architecture. They are easy to identify. You rarely have this level of clarity in a modern data center, of course. It’s an extremely complex environment, and the more complex the environment the harder it is to determine the failure domains within it.
Here are some major factors that affect failure domain complexity:
- Abstraction: Environments where physical resources (like CPU or storage) are abstracted and pooled into virtual resources are typically more resilient, because failure of one physical device is less likely to be critical. However, abstraction makes it harder to calculate failure domains, because it’s hard to know how many physical devices can fail before your abstracted resource becomes insufficient for your workload.
- Built-in redundancy: Some systems (for example, distributed databases) have some degree of redundancy built in by default. Built-in redundancy narrows the scope of failure domains, but assuming that a system is failure-proof just because it's inherently redundant would be a mistake.
- Distributed architectures: In distributed computing environments, where workloads are spread across multiple servers, networks and so on, it's rarely obvious how one component’s failure will impact other components. The only way to distinguish failures within such an environment that can be tolerated from those that can lead to a critical outage is through a deep understanding of how all the components are mapped together and which dependencies exist between them.
Besides components of the IT equipment and software that runs on it, computing environments’ reliability also depends on data center infrastructure and operational practices. They, too, introduce a swath of failure domains. The failure domains within a typical environment hosted by Equinix Metal, for example, include:
- Servers
- Top-of-rack switches
- Clos networks
- Edge routers
- Racks
- Rack-level cooling systems
- Rack-level power supplies
- Rack power buses
- Data center cooling system
- Data center UPS system
- Automatic transfer switches
- Generators
- Generator fuel storage
- Fuel supplier contracts
- Physical data center building
- …
The list goes on, but you get the point. The sheer amount of components that make modern computing possible and the complexity of their interdependencies make determining failure domains difficult—but it’s certainly not impossible, and it’s well worth the effort!
Service Degradation Is Not Failure
The good news—even if it does complicate things further—is that a single failure in a complex system doesn’t necessarily lead to an outage.
In his paper titled How Complex Systems Fail Richard Cook writes that virtually all systems perpetually run in a state of partial failure. The failure causes a degree of performance degradation which in most cases is tolerable. The art of preventing an outage is about not letting the degree of degradation cross into intolerable territory. A totally failed state often requires multiple failures occurring simultaneously.
Part of identifying failure domains is differentiating between degradation and complete failure. You must factor in just how much service degradation your system can tolerate before an actual outage occurs. If a component can fail without causing unacceptable degradation in other components, then the component's failure domain is narrower than one for a component whose failure would instantly render other components totally inoperable.
External Services as Failure Domains
Another important dimension is your workloads’ reliance on external services—say, cloud providers—and the service providers they in turn rely on. These are called third-party failure domains and fourth-party failure domains, respectively.
If, for example, you host applications in a cloud provider’s single availability region and don’t mirror them elsewhere, you have a third-party failure domain associated with that region. If you happen to rely on a SaaS tool whose provider hasn’t bothered to replicate it in more than one region, the single region they’re using is a fourth-party failure domain as far as your applications are concerned.
Why Bother With Failure Domains?
Organizations often hone in on a specific failure domain—in other words, they start looking at a component from the perspective of the effects of its failure—in the aftermath of a serious incident. Putting in the effort to define failure domains in a system proactively is meant to limit those effects, so that a single component’s failure doesn’t lead to a serious incident.
Outages stress and distract engineering teams. They are costly for the business, which on top of losing revenue might have to pay for things like overnight shipping, emergency consultants and so on. And outages are disruptive for customers, of course, with the risk ranging from a few hours of frustration for customers to your company’s reputation taking a hit, with all the pernicious consequences.
Most engineers know this, of course. There’s nothing new to the idea of trying to avoid outages through preemptive planning. The problem is that teams’ resources are finite and planning for failures that might happen in the future often takes the back seat to more immediate business concerns.
Talking about failure mitigation strategy is difficult. Making every component of every system redundant would cost too much and add too much management overhead. If you somehow magically knew which systems were going to fail and how, you could invest your limited resources in redundancy only where it would really matter. But no-one has that luxury, and any mitigation measures are accompanied by risk that the components they are taken for won’t fail. There’s also always the risk that the measures won’t work, that there will be an outage despite them.
Thinking in terms of failure domains provides a measure of guidance to teams that want to make informed decisions about failure mitigation in complex systems, where so many things are uncertain. You can't predict where or when a failure will happen, but you can accurately assess just how broad the scope of a potential failure would be. In turn, you can determine which mitigations to prioritize.
A Few Practical Failure Domain Tips
Now that we know why failure domains are important and how to work with them, let's talk about how to optimize your approach to using failure domains as part of resilience planning.
Assuming we’ve convinced you that failure domains are a useful concept and shown how complex figuring them out can get, let’s talk about ways to apply the concept successfully. Here’s a handful of guiding principles to rely on if you choose to embark on this journey.
Know System Boundaries
Every element in a system—a feature, a new capability, a Kubernetes pod, a container, a VM, a piece of hardware—is in itself a system. Identifying that subsystem’s boundaries, which delineate it from other components of the broader system, is an important part of defining failure domains.
If you know a subsystem’s boundaries, you know what else in the broader system may be affected by its failure. A component in a complex system normally has multiple sets of system boundaries, so it’s important to identify the ones that are relevant in the context of failure domains. The system boundaries of a container or a VM, for example, may be the resources available to it (CPU, memory), the application it’s running, or its network connection.
Weigh Failure Domain Complexity
When thinking about adding a new element to a system, consider how the addition will affect the overall system’s failure domain complexity. More complex failure domains make it more difficult to recover from major, wide-reaching failures, so it’s important to weigh the addition’s value against its impact on complexity.
That analysis starts with defining system boundaries of the addition being considered. Its failure domain—and the amount of additional complexity it would create—would depend on the components it would interact with and their own failure domains.
“MySQL runs on a single server and it’s a failure domain that’s very easy to understand and mitigate.”
Go for the Simplest Option
Whenever possible, opt for simplicity in system design. The more complex your system, the more complex its failure domains will be.
When deploying a database, for example, Cassandra might seem like a smart choice from a reliability perspective. It’s distributed, so a single node’s failure wouldn’t render the entire database unavailable. But Cassandra is also super complex. It uses things like the Paxos consensus protocols for managing distributed state and requires deep expertise to configure it to really take advantage of its fault-tolerance capabilities.
If your application requires something as complex as Cassandra, deploy it with the understanding that your failure domain complexity will increase significantly as a result. If a simpler option, like MySQL, will suffice, it’s the right choice. (MySQL runs on a single server and it’s a failure domain that’s very easy to understand and mitigate.)
Leverage Physical and Logical Failure Domains
Logical and physical components of a system have logical and physical failure domains, respectively. A hard disk is a physical failure domain, while a database is a logical one.
If you delineate between the two, you can combine them to increase system resilience. Distributing data in a database (logical failure domain) across multiple disks (physical failure domains) in a storage server makes the database resilient to failure of a portion of the physical discs in the server.
The idea is to avoid one-for-one overlap between a physical failure domain and a logical one, which creates a single point of failure.
Statefulness Complicates Failure Domains
Stateful applications are especially vulnerable to failure. They have more failure domains than stateless applications do. If a server hosting a stateless microservice fails, a copy of the microservice on another server can easily step in. It doesn’t matter that the copy doesn’t know what state the original was in before it fell down.
If, however, the primary data store backing a stateful application fails, the backup will only work if the latency between the backup and the primary is very low. If both are in the same cloud availability zone, as long as the failure is limited to a single physical host or a single building and not the entire zone, the backup can pick up right where the primary left off. But if the backup is in a different availability zone, hundreds of miles away, network latency is likely too high to ensure the backup is in the same state as the primary last was. Other factors, like inefficient routing, can also make latency too high to effectively keep the state of redundant copies of an application in sync.
Dependencies within a system heavily influence its complexity—and therefore the complexity of its failure domains. So, limiting dependencies goes a long way in making the job of defining and mitigating failure domains easier. State is already a difficult problem in distributed systems, and more complexity can lead to a more rigid and less resilient system design.
Tie Failure Domains to Error Budgets
Your work with failure domains is shaped to a great extent by your error budget. Since any complex system always runs in a partial state of failure, an error budget is a way to quantify the acceptable degree of failure within the system. Therefore, the lower your error budget, the more aggressive your failure mitigation measures should be.
If you’ve promised “five nines” availability (99.999%), or about two minutes of downtime per quarter, your failure domain mitigation strategy needs to keep your system within that error budget.
It's critical to set realistic reliability targets. Overly tight error budgets can make it impossible to manage failure domains effectively.
Quantifying Failure
Whether or not it’s possible in your specific situation to go and implement all these best practices, it’s important to at least start thinking about failure systematically. Teams often avoid or postpone these discussions—because redundancy is expensive; because they deal in hypotheticals rather than concrete, present issues; and because there’s so much complexity and uncertainty to deal with. But defining failure domains can make these conversations easier, helping businesses decide what level of reliability they require and are able and willing to invest in.