What if your primary data center were completely destroyed? Or what if an event like Hurricane Sandy caused your data center to be without power for a day, a week, or a month?
What if someone vandalized your data center or managed to steal equipment? What if your systems were hacked?
What if a key infrastructure component like your SAN died?
What happens when an employee loses a laptop or their iPad gets stolen with company data on it?
What about scheduled downtime – how do you handle “patch Tuesday” and similar maintenance windows?
These are the types of questions you ask when improving your disaster recovery plan (if you don’t have answers to any of these questions, then you don’t have a disaster recovery plan … you should fix that right away … feel free to call).
There are actually three related disciplines involved with these kinds of questions, oversimplified as:
- Disaster Recovery (DR) – how to recover your systems after a disaster (large or small)
- Business Continuity (BC) – how to do business after a disaster, even before your systems are recovered
- High Availability (HA) – how to prevent your systems from going down at all, even when there’s a disaster
Business continuity is more of a business function than an IT function – BC is ultimately all the “human stuff” that you have to address after a disaster. DR and HA, on the other hand, are core functions of IT. It used to be that high availability was reserved only for key systems, but as DR tools improve and HA costs come down, the lines between DR and HA have become quite blurred.
Instead, it’s useful to talk about the underlying goals of DR – how fast can we recover from an incident and how much data can we stand to lose? The fancy names for these are
- Recovery Time Objective (RTO) – how fast you can recover
- Recover Point Objective (RPO) – how much data you can afford to lose
If your company makes offsite backups of all your systems and data once every midnight, then your RPO is about 24 hours. The worst case scenario for you is that your production systems get completely fried at 11:59pm, right before your backup and you lose a whole day’s worth of transactions. When you go to recover, you’ll have to use the previous day’s backup. If your systems got fried right after a backup and before you start work for the day, then you wouldn’t lose any data, but RPO is calculated on worst case, not best.
If you make hourly backups in your data center but never make offsite backups, then you’ve protected yourself against internal disasters like a disk crash (with a 1 hour RPO), but you haven’t protected yourself at all from a big disaster like fire. That’s why you have to constantly ask yourself “what if” when dealing with DR.
Improving RPO
One of the simplest ways to improve RPO is to use “cloud backed storage” in your data center. The idea is that data files are stored locally but duplicated and kept up-to-date in commodity storage in the cloud. This is an extremely cost-effective form of backup, and it gives you nearly instantaneous RPO. If your data center were completely lost, you’d lose only a few moments of data transactions. Companies like Nasuni have taken this basic architecture and delivered an enormous amount of functionality around it for shockingly low cost. I’ll share more detail on cloud backed storage products in a future post.
Improving RTO
Improving RTO is a bit trickier – and costlier. There are a wide variety of techniques to help you recover systems quickly. One approach is maintaining multiple active systems in different locations that can absorb system load if one of the sites goes down. I recently spoke to the CIO of a large web application company that uses this approach extensively. If their website goes down, they lose millions of dollars per minute. Needless to say, the CIO is not going to let the site go down. That said, he still wants to maximize the value of his infrastructure investments, so he chooses not to have any passive data centers or nodes. Everything his team designs favors an active-active approach. They use Akamai to cache the front end, so even if an entire data center went offline, users wouldn’t perceive an interruption. This is the very definition of High Availability – an RTO of zero.
On the other end of the RTO spectrum, you can manually recover backups if your primary data center goes down. The problem with this approach is that it takes quite a while – and to make things worse, because you don’t know how long it will take, it’s tough to make the call to failover. The power company doesn’t usually tell you, “we’re going to be down for 93.4 hours” because they don’t know either. Obviously, if it takes longer to recover than you expect the outage to be, you might as well not start the failover … if you have a manual failover, you probably have a manual failback, which means that you’ll probably encounter another disruption in service when the original problem is eliminated.
There’s a lot of benefit in moving to the middle of the RTO spectrum, and you can do so for reasonable cost. The key is replication. Synchronous replication techniques ensure that a transaction on one node doesn’t complete unless it also completes on synchronized node(s). True synchronous technology tends only to be practical within a physical data center, so it’s used for things like database clusters (and at a lower level, SANs themselves).
With asynchronous replication, a transaction (perhaps simply a disk write) on the primary node is complete without waiting. The replication happens just after the transaction occurs, so there is a slight possibility of data loss (i.e., a slightly higher RPO). The advantage of async, however, is that the replication can be transmitted halfway around the globe in a practical and cost-effective fashion. This makes asynchronous replication a great technique for maintaining passive nodes at DR data centers.
VMWare and Microsoft have been building more and more async capabilities directly into their hypervisors. Products like DoubleTake add management tools that make failing over and failing back surprisingly painless. These types of tools can help you restore a given system in 15 minutes or less. Realistically, you can reduce your overall data center RTO from days or weeks for manual failover to hours or even minutes using asynchronous replication tools. Think about the value of restoring your operations after a shared natural disaster significantly faster than your competitors (or vice versa – they might be thinking of it, too). The value is likely many times the cost. For that reason, I’ll definitely share more on async replication in a future post.
Moving to the Cloud
If you run your data center entirely “in the cloud,” then you’ve outsourced your DR/HA to your cloud provider. This is usually a good strategy. I’ve visited one of Microsoft’s cloud data centers, and trust me, you don’t have the same level of resources and capabilities for keeping hardware up and running that Microsoft does. That said, NO data center is perfect – even the best cloud providers experience occasional downtime.
So over the next 3 years, it’s highly likely that Microsoft will experience much less downtime than you will … but here’s the rub – when they go down, there’s absolutely nothing you can do about it. At least when your internal SAN crashes, you can explain to your stakeholders what a black swan event it was while you lose sleep to fix the problem. When Windows Azure is down, all you can do is say, “it’s down” and maybe cite an estimate of when it will be back up.
Again, in most cases, public cloud providers are going to provide much better uptime than you can (it is, after all, the entirety of their business), so it’s usually a better bargain than hosting your own hardware. But it’s the loss of control – along with all the emotional effects that go with it – that keep companies hosting their own hardware. Many people are still not ready for that loss of control … but the economics will eventually push most into the cloud anyway.
There are always ways to mitigate risk, especially if you’re willing to spend more money. To mitigate the loss of control, you can replicate across data centers of a single cloud provider, reducing the dependency on a single node (for some types of data, cloud providers do this automatically). You can even replicate across cloud providers. You can continue to self-host but use the public cloud just for DR – or make the public cloud your primary site and leverage your existing data center investments for DR. Or you can double down and use more SaaS products like Office365 and salesforce.com – and ultimately outsource even more of the technology/service stack.
Helping companies move to the cloud is turning into one of the main things I do, so I’ll talk much more about these various options in future posts. From a sales perspective, it’s been quite interesting to learn that companies don’t move to the cloud for its own sake – they move for a reason, and improving DR/HA is often that reason. Some of the cloud messaging I shared at Microsoft missed this point and sometimes came off as circular reasoning (“Why move to Windows Azure? Because it’s the cloud!”).
There are many, many ways to improve DR and increase overall uptime. As the sophistication and value of these techniques improve, the cost of implementing them is going down. So if you haven’t revisited your DR architecture in the last 24 months, you owe it to your company to take another look. The cost of inaction could be huge.