In the Trenches: Disaster Recovery
What happens when disaster strikes and your data and infrastructure are compromised? Having a disaster recovery plan helps minimize damage and keep your business running.
Picture this.
Your SaaS business is running smoothly, sales are up and growth is in the near future.
Then, one day, a tornado hits the data center where you’ve placed all your assets and data. The store goes down and your business can’t function until things are back up and running.
This. Is. A. Disaster.
How do you come back from a blow like this? How could you have known a disaster would strike?
Essentially, anything that can halt business operations at the normal level of functionality should be considered a disaster. Examples can include a cyber attack, stealing intellectual property, hardware failure, and data extraction via ransomware. While these all sound preventable, anomalies like hurricanes, tornadoes, freezes, and other natural disasters are outside of human control.
What you can control, however, are the decisions about how your business will function in the event of a disaster. You need a disaster recovery plan.
What Exactly is a Disaster Recovery Plan?
Thanks for asking! While a disaster recovery plan can be many things, there are a few industry terms to understand the scope of the solution you are aiming for. Putting it plainly, disaster recovery (or DR) is the ability to recover your systems back to a baseline point so that the company won’t go under. Dramatic, yes - but true!
Let’s begin by figuring out your Recovery Point Objective (RPO), Maximum Tolerable Downtime (MTD), and Recovery Time Objective (RTO).
- RPO defines the allowed age of your most recent data backup. When was the last time you backed up your data? At the end of each business day? Each week? If your data is a casualty during a disaster, you’ll be able to recover back to the last time you performed a backup. You can then determine how much of an expense it will be to recreate all of the data since the last backup (and if it can’t be recreated, how much data you are willing to lose permanently).
- MTD is the maximum amount of time you can be down before you fail to meet your SLA (service level agreement) and ultimately before your business operations are in jeopardy. This includes the amount of time it takes to fully recover from a disaster scenario.
- RTO is simply how long it takes you to recover and restore following a disaster. While MTD is an operational determination, RTO is a technical one. Your IT department should know all of the components in your system and determine how long it will take to get everything back up to speed.
Now What?
There are a few key resources to consider when designing and implementing a DR plan. These include data redundancy, moving data, and forming an executive contingency plan.
Sometimes data is so nice, you should save it twice (seriously). Keeping multiple copies of your data in different locations can increase your chances of a fast recovery in the event of a disaster. If you’d rather move your data from one spot to another, there’s an option called high availability.
High availability (HA), is an immediate redundancy implementation strategy to relocate your systems and data in the event of a disaster in order to keep your services running for customers.
Depending on the size of the company’s infrastructure, there might be some hang-ups in the event of a disaster. Let’s say your environment contains 1000 servers, and you want to be able to have all 1000 servers running in the event of a disaster. The problem is, there is no guarantee that there will be 1000 available servers available on-demand at any given point. Looking back to the baseline requirements, there has to be a decision about everything critical that goes into your minimum viable environment in case that happens.
Another resource is a business continuity plan. How will you operate during a disaster? If a high-level decision-maker in the company is incapacitated or can’t communicate with key team members for a myriad of reasons, who will take over business operations? This is a contingency plan in the event systems or high-level positions need to be replaced at a moment’s notice.
Getting Down to Brass Tacks
Why do some companies drag their feet on disaster recovery planning? Simply put, it’s expensive!
Unfortunately, the reality is that most companies respond to a problem rather than planning for it. As we mentioned, DR can mean maintaining a separate environment which includes replicating data and making sure it’s operational, which in turn requires additional resources to uphold.
There are plenty of well-wishers out there who think they can get by without investing in a plan. Companies born in the cloud often have a blind faith in network security because they believe the cloud is resilient. But remember, in our scenario, you own a SaaS application, and if that platform goes down you can’t sell anything. You’ve stopped making money and don’t have a backup plan. We tried to warn you!
When it comes down to it: implementing a DRP is a risk decision. Companies might choose to run an additional $1 million of infrastructure to be able to have the constant ability to recover in case of a disaster. Making these preemptive decisions about your data and infrastructure might hurt the wallet in the short term, but can potentially save your company, so pony up for the backup plan.
Considering Chaos Engineering
One of the best ways to prevent disasters is to preemptively find problems and fix them. Enter DevOps.
One of the best examples of this approach to resiliency is Netflix. Netflix wanted its developers and IT operations teams to work hand in hand to build a more resilient and reliable system. They became an early adopter of DevOps and ended up being a model case study for other companies to build their own DevOps teams.
Their plan was to discover holes in their system before an outsider could, and then patch that hole. In order for Netflix to run trials against their system, they built a bot called the “chaos monkey” that was essentially internal malware to discover where the problems in the infrastructure were so they could respond and automate against any attacks.
Soon, they made more bots: a chaos gorilla, a chaos kong, and so on to go after larger parts of the system even up to taking down an entire operating region. The result of this is known as “chaos engineering” and has made Netflix one of the most resilient and reliable systems out there.
The trick is, you have to build a culture around your DevOps team that says it’s okay to fail (and to give them an environment where they can fail openly and often). Netflix’s team was celebrated for failing early and failing often because they blazed the trail for DevOps to become a staple in other companies.
Failure gives us the opportunity to become more resilient and more reliable than before. In fact, when I used to teach DevOps I would tell my students to “fail awesome”. Don’t fail in the same way over and over again, but fail in new and unique ways to learn from them.
Go forth and fail awesome.
The best way to prepare for a disaster is to create a workplace where folks are supported when they “fail awesome.” After that, it comes down to what sort of DR plan you want to create to ensure your company can continue in case of a disaster.
The bottom line is that having a disaster recovery plan is in everyone’s best interest. Have at least a small solution mapped out to prevent you from having to start from square one, and over time you can migrate systems and data into the priority bucket for a more robust DRP. Then, you can take on any disaster that comes your way!