Planning for Resiliency

Everything always "just works", right? Technology never fails, just deploy a single server, and everything will run forever, right?

Sure.

We all know that problems happen. Sometimes they are caused by people. These can include human error (also known as, "we make mistakes") and malicious actors (shudder! we don't want those!). Sometimes they just happen due to natural changes and deterioration. Systems malfunction, unforeseen circumstances cause problems, things just happen.

In order to protect against these problems, we need to have a plan. Unfortunately, we cannot quite give you an entire business continuity and recovery plan in a single post. We can, however, lay out two of the principles and some good examples to help get you on the road to a more resilient technology business.

This article will lay out some of those principles and two areas of concern. Follow-on articles will go provide practical getting started guides for at least one way to mitigate the concerns in each area.

Areas of Concern

When looking at the problem of resiliency, you should consider two questions:

How can you operate in the face of failure of servers?
How can you operate in the face of failure of data?

Servers are your processing units; they do the heavy lifting so that your app runs, your Website serves, your e-commerce site sells, your streaming streams, and your customers keep happy.

Data is the critical information that is the lifeblood of your business: customer accounts, e-commerce orders, payment history, even your own internal information. If you lose it, you are at risk of losing your business.

Each of these can be approached in two ways, but because of the impact of loss on each of servers and data, the necessity of protecting each from the risks is different.

Approaches

We discuss two approaches to protecting your operations from risk: redundancy and backups.

Redundancy tries to avoid having any downtime or loss at all.
Recovery provides you with backups that you can use to recover from loss during downtime.

Wikipedia kindly defines redundancy for us:

In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

In simple terms, it means, "don't rely on just one". Whatever the important thing you have, whether servers or databases, data centers or applications, you should have more than one. If one fails, you still have the other, or better yet, others, to keep things running.

Backups, on the other hand, are the next part of your defense-in-depth. If you have redundancy but it isn't enough, or something or someone manages to corrupt what you have, how do you get back to a good state?

Each of these has its place to play in your overall resiliency plan. We will look at each in turn.

Servers

As mentioned above, servers are your processing units. They do the heavy lifting so that your app runs, your Website serves, your e-commerce site sells, your streaming streams, and your customers keep happy.

Hopefully, your servers are not lovingly generated pets, but rather replaceable cattle. This makes the question of how to operate when your server fails a question of redundancy.

In simple terms, it means, "don't rely on just one". Metal has some great servers. They are powerful and affordable, and, in many cases, also have redundancy built in, for example, power supplies or redundant disks. If you build your entire critical application on one of these servers, you are hoping it doesn't go down. What will you do when it does?

Rather than one really powerful system, distribute your critical application across several smaller ones. Each handles only part of the workload, but if one goes down, the others can pick up the slack.

How Many Servers Do I Need?

There are lots of ways to get redundancy. A simple approach is active-passive, or active-standby, also known as active-failover. Let's say that Metal's m3.large.x86 can handle all of your compute requirements. In order to be redundant and safe, you deploy two of them, ready to run, with all of your software installed and configured. If the primary goes down, the backup server is ready and picks up the slack.

Of course, you now are paying for two whole servers, when you only need one. This is quite an expensive form of redundancy. It may be worth it, particularly if your application has no other way to get this redundancy, but it would be better if you could get the same level of redundancy, or perhaps even more, for less cost.

If your application can run in both places at once, you can have even faster recovery and better use of your resources. Instead of active-passive, run them active-active. Set up a load balancer in front of both instances, and direct traffic to both of them. Have them both active at the same time.

To be clear, while this makes the installation and management easier - you no longer need to track which is the currently active server, and figure out how to recover from a failure - it does not reduce your costs. Let's say that each server can handle 100% of your workload. If you run both of them active-active, you could double the amount of work!

Not exactly.

What happens when one of them fails? 200% of your work, which used to be split on each of two servers, now falls on the remaining one server, which, obviously, cannot handle 200% of its capacity! Since it now is stuck handling only half of what comes in, that's not exactly great redundancy. The only way to make this redundancy work with two servers is to load each of them to a maximum of 50%. Sure, you have an easier time setting up and managing your servers, but your cost efficiency went down by half: two servers handling a maximum of 50% load each. Twice the cost, 0% increased workload capacity.

If you are really lucky, or have designed your application really well, then you are able to run not just 1, not just 2, but almost any number of applications at once, in parallel. In this case, you can get a fully active "N+1" design. You can run N servers, each handling 1/N of the workload, and just need one spare. As long as only one fails, the rest can pick up the load.

Let's continue our above example. Instead of just one - or one plus a backup - large server that can handle 100% of your application workload, we are going to deploy 5 servers. With 5=N+1, it means N=4, so each server handles 1/4 or 25% of the workload. Now you are paying for N+1 or 5 servers, when you only need 4. However, your "extra" server is 1/4 as powerful, and therefore much less expensive, than when you had just 2. Further, because you have so many servers, you can handle any one of them failing much more easily.

What if you are concerned about the failure of 2 servers? In the basic active-standby or active-active cases, you would need 3 servers, each able to handle 100% of your load. Now you are paying 3 times as much! With N+2, you need 6 servers, each able to handle only 25% of the workload, for a total of 150%. That is going to cost you about half of the 3 servers, with a lot more flexibility. Not bad.

What About Load Balancers?

Now that you have those multiple servers running, and are much more comfortable handling a failure here or there (even if you really prefer not to), you need to think about how to load the traffic between them. The most common way of doing so is with a load balancer.

Metal is intentionally un-opinionated about what load balancer you use. There are lots of great software load balancer options, and we want you to be able to upgrade between them as the state of the art improves. That also lets you pick the best ones for your needs.

Metal also has a good, deep, technical dive into implementing a load balancer with HAProxy across two servers providing active-passive redundancy, including redundant Web servers and redundant load balancers, here. We invite you to check that one out and even try it out.

Finally, if your workloads are deployed in a Kubernetes cluster running on Metal, you can take advantage of all of the Kubernetes load-balancing goodness built into Services and the more advanced service meshes.

Backups

Do you need backups for your servers?

The answer depends on two questions:

Are your servers pets or cattle?
Do your servers store irreplaceable data?

Cattle vs Pets

The phrase "cattle not pets" refers to the idea that you should treat your servers as completely replaceable, like cattle. There is nothing special or unique about server A vs server N vs server 125,627. All of them are pretty much identical, and can be thrown out and replaced by an identical replacement at any time.

They should not be treated like pets, which are unique and special, have their own unique characteristics and personalities, in addition to special relationships with you, the owner.

Good modern design insists on creating and treating your servers as completely interchangeable cattle. If you do, then the question of backups for the server per se is irrelevant. You don't need backups, because the server is replaceable immediately. If it fails, you just throw it out and get a new one.

On the other hand, not every server out there is (yet) cattle. You may be dealing with servers that actually are pets. If they go away permanently, you will need to spend a large amount of time crafting the replacement.

If your servers are pets - and we strongly encourage you to do everything to avoid it - then you very much should be backing them up. When it fails, as all technology inevitably does, you will be able to recover much more quickly if you simply can restore the server's setup and configuration from a backup, rather than handcrafting it again from scratch over days or longer.

Irreplaceable Data

Even if your servers are cattle, some of those servers may actually have data on them. After all, the critical files, 3D designs, customer purchase records, balance sheet information and "crown jewels" software source code have to live somewhere, and that "somewhere" is going to be a server somewhere.

When planning resiliency for the servers that have critical data, we still recommend that you treat the servers as cattle, but provide a backup and recovery strategy for the data itself separately.

Do not treat the server as "the server with irreplaceable data," but rather treat it as "just another replaceable server, that happens to have data on it that is irreplaceable."

Data

Data has the same problems as servers, and then one step more.

Some servers, like database servers or file servers, maintain and manage data. You do not want to lose those servers dishing out your data. Beyond the basic reasons discussed in our article on data, to wit, that you cannot serve customers, you also have to worry that your data is sitting on a disk on a server that is gone.

Replacing one server with another gives you equal processing power, but it does not necessarily give you all of the live data that was sitting on the disk. Maybe it was the logs, maybe it was uploaded video or text files, or maybe it was the storage behind your critical business database.

Even if the database server is replaceable cattle, the data on that server is not.

This means we need to look at data at not just one level, but at two levels:

Redundancy: keeping operating when the data storage system fails, temporarily or permanently.
Recovery: ensuring you have reproducible copies of your data at defined moments in time.

What happens if you lose data? None of us wants our software processes or the servers running on them to fail. If they do, we suffer downtime, which affects our ability to operate, leading to lost revenue and increased costs.

However, as long as you have everything in your business, you can recover from it.

What happens if you lose your data? Business data is critical to your business. Your files or database are on your filesystem somewhere; you had better make sure you have regular backups, as part of a holistic backup strategy.

Equinix Metal offers a bare-metal service, with disks connected directly to the servers, known as direct-attached storage (DAS). This provides great performance, without any questions of network latency or performance. It does mean that your data is on the same physical hardware. If (or when) you lose that server, you lose your data.

This makes a dedicated strategy to resiliency and recovery of data, and not just servers, critical.

How can you go about getting those in place on Metal?

Redundancy

At the simplest level, you want to ensure that your data does not sit on just one server, but is duplicated, in some fashion, across multiple. The same way you have more than one key to your house, and, preferably, more than one server running your critical processes, you want to have more than one copy of your data. If you have just one key to your house, and it gets lost, you can get back in, but it will take a locksmith breaking the door and replacing the entire lock. That is going to take a lot longer, and cost a lot more, than you expected.

On the other hand, if you have a spare key, you can just go get it, and get back in. You even can use the spare key to make another copy, and you are back to where you started.

There are two primary ways for you to get multiple copies of your data: application and infrastructure.

Application

Certain applications that process critical data try to make redundant copies available for you as part of the software. These applications are built from the ground up to treat their data as critical and assume it must be in multiple places.

For example, you might be using a message bus, such as Kafka or RabbitMQ. You can configure these buses to keep multiple copies of each message, on multiple systems. With the loss of one, you just keep moving.

Similarly, many databases have built-in replication features. You can run multiple copies of the database across multiple servers, and configure your database to duplicate the data to all of them. With the loss of one, you can keep operating. Depending on the database technology, you may need to "switch" which one now is the primary, but your critical data remains safe and sound.

Some databases, primarily nosql-style databases, even have support for multi-master. Not only can you have multiple copies running in parallel, but any one of them can receive a write at any time. This is a great way to ensure that you have zero work to do, not even automated, let alone manual, when a server fails. Everything just keeps running.

Infrastructure

If your application does not have built-in redundancy, you can add it yourself. There are multiple storage technologies that create automatic and ongoing duplication of data between servers. These can be at the level of the block device or the storage system. For example drbd provides distributed block storage, Ceph can operate at both the block and filesystem levels, glusterfs is a distributed filesystem, and MinIO provides distributed object storage; there are many more options.

Whether you do replication at the application level or the infrastructure level, you protect yourself from sudden downtime due to a single server's failure taking storage with it.

Fortunately, Metal has several guides to using these storage technologies.

Recovery

As good as redundancy is, it isn't perfect. You still need backups. There are two reasons to have these backups.

First, you always run the risk of multiple data storage systems failing at once, or the whole replication system failing.

When it comes just to processing, if you lose all your servers, you lose the ability to operate for as long as they are down. When they come back, however, or when you deploy new ones, you pick up the moment they are back.

With data, if you lose all the data and deploy new servers, they are blank. You do not have the data that was on the original ones.

Second, backups provide you snapshots in time. While some storage systems and data software keep copies of all changes, for example version control systems and some copy-on-write filesystems, the majority do not. Whether due to data corruption, bad actors or just random changes - not to mention compliance - you want to be able to know the exact state of your data at regular intervals.

Coming back to our previous scenario, if you lose all of your data replicas, when your fresh, new systems come on line - missing the data - you can restore from your backups.

To see how important it is to protect against bad actors, see our own blog posting on ransomware survivors.

Backups provide this protection and comfort for you.

Let's lay out a few important principles of backups.

First, they always should be snapshots in time. You should be able to pick any given moment in time from your backups and restore it. Last week? Check. 13 days ago? Check. 3 months ago? Check. 2 years ago? Probably; it depends on how long your backup strategy has you keeping backups. 100 years ago? We are pretty sure modern digital computers didn't exist back then.

Which brings us to the second principle: operate according to a backup strategy. Backups should follow a well-designed backup strategy. It should lay out how often you back up; if any are whole backups or incremental, which ones; where you store the backups; how often you practice recoveries; and many other elements. Perhaps most importantly, it lays out who ones the process. It is risky to assume that each group will handle it on their own. A good backup strategy designates the true owner of data and its protection.

Third, "you only are as good as your last backup, but your backup is only as good as your last restore." It isn't enough to do backups. You have to test them. If the first time you try recovery is when you desperately need it, chances are pretty good it won't work. You need to test your backups regularly, seeing it the restores work, and adjusting where necessary. Recoveries aren't what come after, they are part of your backup strategy.

How can you do backups on Metal?

You need two things: backup location(s) and backup software.

Backup Locations

Backup Locations are one or (preferably) more locations to which you send the backed up data. These locations should be part of a distinct access credential with its own processing. You don't want any bad situations, whether accidental or malicious, spreading from your primary data location to your data storage location, or all of that backup may become as useless as the original data.

In general, we recommend using one of the following:

A separate account with separate servers, ideally in a different region.
A file or object storage service, whether managed like AWS S3 or owned like Min.io, in a different account and different region.

Backup Software

Backup software is the software that actually does the backups, and may perform recoveries as well.

You have many choices for backup software, both free open source and commercial.

Equinix Metal partners with several companies to offer storage solutions, including backups. The partner list is available here. For example, read this use case with our partner Cohesity for backing up data from your Equinix Metal infrastructure to other cloud providers.

Conclusion

Getting your service up and running is only the first part. Ensuring your service so it continues to run, handling both loss of processing servers and loss of data, whether short-term or long-term, is the second part. Redundancy of services, combined with redundancy of data and backups for recovery of data protect your valuable business.

In our next articles, we provide you with practical getting started guides to each of these areas, resiliency for servers and data.

Partners & Examples

Planning for Resiliency

On this page

Areas of Concern