Monthly Archives: June 2013

Make Your First Dollar

What If

What if your primary data center were completely destroyed? Or what if an event like Hurricane Sandy caused your data center to be without power for a day, a week, or a month?

What if someone vandalized your data center or managed to steal equipment? What if your systems were hacked?

What if a key infrastructure component like your SAN died?

What happens when an employee loses a laptop or their iPad gets stolen with company data on it?

What about scheduled downtime – how do you handle “patch Tuesday” and similar maintenance windows?

These are the types of questions you ask when improving your disaster recovery plan (if you don’t have answers to any of these questions, then you don’t have a disaster recovery plan … you should fix that right away … feel free to call).

There are actually three related disciplines involved with these kinds of questions, oversimplified as:

Disaster Recovery (DR) – how to recover your systems after a disaster (large or small)
Business Continuity (BC) – how to do business after a disaster, even before your systems are recovered
High Availability (HA) – how to prevent your systems from going down at all, even when there’s a disaster

Business continuity is more of a business function than an IT function – BC is ultimately all the “human stuff” that you have to address after a disaster. DR and HA, on the other hand, are core functions of IT. It used to be that high availability was reserved only for key systems, but as DR tools improve and HA costs come down, the lines between DR and HA have become quite blurred.

Instead, it’s useful to talk about the underlying goals of DR – how fast can we recover from an incident and how much data can we stand to lose? The fancy names for these are

Recovery Time Objective (RTO) – how fast you can recover
Recover Point Objective (RPO) – how much data you can afford to lose

If your company makes offsite backups of all your systems and data once every midnight, then your RPO is about 24 hours. The worst case scenario for you is that your production systems get completely fried at 11:59pm, right before your backup and you lose a whole day’s worth of transactions. When you go to recover, you’ll have to use the previous day’s backup. If your systems got fried right after a backup and before you start work for the day, then you wouldn’t lose any data, but RPO is calculated on worst case, not best.

If you make hourly backups in your data center but never make offsite backups, then you’ve protected yourself against internal disasters like a disk crash (with a 1 hour RPO), but you haven’t protected yourself at all from a big disaster like fire. That’s why you have to constantly ask yourself “what if” when dealing with DR.

Improving RPO

One of the simplest ways to improve RPO is to use “cloud backed storage” in your data center. The idea is that data files are stored locally but duplicated and kept up-to-date in commodity storage in the cloud. This is an extremely cost-effective form of backup, and it gives you nearly instantaneous RPO. If your data center were completely lost, you’d lose only a few moments of data transactions. Companies like Nasuni have taken this basic architecture and delivered an enormous amount of functionality around it for shockingly low cost. I’ll share more detail on cloud backed storage products in a future post.

Improving RTO

Improving RTO is a bit trickier – and costlier. There are a wide variety of techniques to help you recover systems quickly. One approach is maintaining multiple active systems in different locations that can absorb system load if one of the sites goes down. I recently spoke to the CIO of a large web application company that uses this approach extensively. If their website goes down, they lose millions of dollars per minute. Needless to say, the CIO is not going to let the site go down. That said, he still wants to maximize the value of his infrastructure investments, so he chooses not to have any passive data centers or nodes. Everything his team designs favors an active-active approach. They use Akamai to cache the front end, so even if an entire data center went offline, users wouldn’t perceive an interruption. This is the very definition of High Availability – an RTO of zero.

On the other end of the RTO spectrum, you can manually recover backups if your primary data center goes down. The problem with this approach is that it takes quite a while – and to make things worse, because you don’t know how long it will take, it’s tough to make the call to failover. The power company doesn’t usually tell you, “we’re going to be down for 93.4 hours” because they don’t know either. Obviously, if it takes longer to recover than you expect the outage to be, you might as well not start the failover … if you have a manual failover, you probably have a manual failback, which means that you’ll probably encounter another disruption in service when the original problem is eliminated.

There’s a lot of benefit in moving to the middle of the RTO spectrum, and you can do so for reasonable cost. The key is replication. Synchronous replication techniques ensure that a transaction on one node doesn’t complete unless it also completes on synchronized node(s). True synchronous technology tends only to be practical within a physical data center, so it’s used for things like database clusters (and at a lower level, SANs themselves).

With asynchronous replication, a transaction (perhaps simply a disk write) on the primary node is complete without waiting. The replication happens just after the transaction occurs, so there is a slight possibility of data loss (i.e., a slightly higher RPO). The advantage of async, however, is that the replication can be transmitted halfway around the globe in a practical and cost-effective fashion. This makes asynchronous replication a great technique for maintaining passive nodes at DR data centers.

VMWare and Microsoft have been building more and more async capabilities directly into their hypervisors. Products like DoubleTake add management tools that make failing over and failing back surprisingly painless. These types of tools can help you restore a given system in 15 minutes or less. Realistically, you can reduce your overall data center RTO from days or weeks for manual failover to hours or even minutes using asynchronous replication tools. Think about the value of restoring your operations after a shared natural disaster significantly faster than your competitors (or vice versa – they might be thinking of it, too). The value is likely many times the cost. For that reason, I’ll definitely share more on async replication in a future post.

Moving to the Cloud

If you run your data center entirely “in the cloud,” then you’ve outsourced your DR/HA to your cloud provider. This is usually a good strategy. I’ve visited one of Microsoft’s cloud data centers, and trust me, you don’t have the same level of resources and capabilities for keeping hardware up and running that Microsoft does. That said, NO data center is perfect – even the best cloud providers experience occasional downtime.

So over the next 3 years, it’s highly likely that Microsoft will experience much less downtime than you will … but here’s the rub – when they go down, there’s absolutely nothing you can do about it. At least when your internal SAN crashes, you can explain to your stakeholders what a black swan event it was while you lose sleep to fix the problem. When Windows Azure is down, all you can do is say, “it’s down” and maybe cite an estimate of when it will be back up.

Again, in most cases, public cloud providers are going to provide much better uptime than you can (it is, after all, the entirety of their business), so it’s usually a better bargain than hosting your own hardware. But it’s the loss of control – along with all the emotional effects that go with it – that keep companies hosting their own hardware. Many people are still not ready for that loss of control … but the economics will eventually push most into the cloud anyway.

There are always ways to mitigate risk, especially if you’re willing to spend more money. To mitigate the loss of control, you can replicate across data centers of a single cloud provider, reducing the dependency on a single node (for some types of data, cloud providers do this automatically). You can even replicate across cloud providers. You can continue to self-host but use the public cloud just for DR – or make the public cloud your primary site and leverage your existing data center investments for DR. Or you can double down and use more SaaS products like Office365 and salesforce.com – and ultimately outsource even more of the technology/service stack.

Helping companies move to the cloud is turning into one of the main things I do, so I’ll talk much more about these various options in future posts. From a sales perspective, it’s been quite interesting to learn that companies don’t move to the cloud for its own sake – they move for a reason, and improving DR/HA is often that reason. Some of the cloud messaging I shared at Microsoft missed this point and sometimes came off as circular reasoning (“Why move to Windows Azure? Because it’s the cloud!”).

There are many, many ways to improve DR and increase overall uptime. As the sophistication and value of these techniques improve, the cost of implementing them is going down. So if you haven’t revisited your DR architecture in the last 24 months, you owe it to your company to take another look. The cost of inaction could be huge.

What I’ve learned so far

Leave a reply

Last week I finished up my first consulting engagement. Or at least I presented my wrap-up summary and final invoice. I probably shouldn’t say I’m “done” until I get that last check.

It’s been an amazing ride so far, and I’m having a blast. I’m learning again. In fact, I’m learning so much that I have that “drinking from the firehose” feeling … there’s a big difference when drinking from the firehose as a consultant, though, because if you mess up you don’t get paid. As an employee, you get a grace period. As a consultant, you don’t.

This is new to me. Before joining Microsoft, I called myself a consultant for more than 14 years, but I really wasn’t one. I was a high-end temporary employee. Good work, but it’s not consulting. I should have called it contracting.

You see, the way you get paid matters. For 14 years, I was paid hourly … actually, for my highest paying gigs, I was paid daily, but I never liked that, because I have a harder time putting in a “full billable day” than most people I know who are similarly useful. I remember being very proud (a bit cocky even) when I crossed over the $1,000/day barrier in the mid 90’s. I used to whisper to myself, “another day, another grand” – that’s pretty obnoxious now that I think about it. But I only felt good about charging for a full day maybe one day out of five, so I ended up essentially translating into hours anyway by charging for half a day here, three fourths of a day there. My standard “billable day” was about 6 hours long, so I preferred billing hourly to avoid any sense of impropriety.

When you charge by the hour, your incentive is to work more hours, to stretch out your value for the customer. I know a little about a lot, so I’m generally handy to have around. I would poke my nose into lots of different situations for a customer, and I would genuinely make myself useful. People would invariably say, “Patrick, could you join us for this meeting? We’d like your perspective on this” or “Patrick, could you investigate xyz while Joe’s out? We need an answer right away.”

In the late 90’s, my friend Linda brought me into a bank in San Francisco, ostensibly to implement column-based security in a custom database app. I was there for more than 2 years doing a little bit of everything, from coding web apps to teaching Java to performing Y2K remediation. Linda called me her “pinch hitter” because she’d bring me into tough situations or put me on projects where she simply didn’t have anybody else who could do the work. She knew I’d figure it out, whatever it was. It was fun and paid well (although San Francisco during the dot com boom was crazy expensive). The only reason I left was that I finally convinced my wife to move to my childhood home of Grand Rapids, MI (you try convincing a Texas girl of that – once you get an opening, you take it). Even then, Linda called me a couple months later and asked me to come back. After all, I was handy – but ultimately, I was just an expensive employee who paid his own benefits and could be fired easily.

Funny detail about that gig – I never actually addressed the security issue that I was brought on to fix! It was kind of a hard problem and there was a lot of low-hanging fruit elsewhere. As a contractor, I was happy to bill hours for anything. Another hour for this customer meant I didn’t have to spend time looking for a new customer. And once there was a way to pay me, my “boss” didn’t really care about justifying my expense for any specific objective. I was useful to her. I solved her problems, so she wanted to keep me around. I made her life easier. That’s nothing to sneeze at – it’s good to be useful, whether you’re a full-time employee or a contractor – but how do you put a dollar value on that kind of usefulness? There was clearly a cap that was intuitively based on the fully-loaded cost of a comparable employee plus a premium for the ability to get rid of me at will.

A new friend recently observed, “You like to do things the hard way, don’t you?” You know what – I do! I already know I can do hourly contracting well, since I did it for 14 years. It’s not interesting to me anymore. Leaving Microsoft, I wanted to do things differently, even as I “returned” to consulting. So here’s the big difference: this time, I’m not charging for my time. I’m following the principles of Alan Weiss and billing based on the value I create for clients. In other words, I’m charging a fixed price for a specific result.

Billing based on value is utterly terrifying – but in a wonderful way. At a deep, philosophical level, I want to go where I can create the most value. I think most people do. Well, if you want to create a lot of value, you have to be pretty specific about it. When I was charging by the hour, I wasn’t specific. My value was always assumed … it was always positive, but it was pretty vague, too. Now I’m talking to customers and attempting to quantify the value I can create for them. Whoa. It’s scary, but it’s also focusing. It forces me to work closely with customers before engaging to find out where I can make the biggest impact. This sales process is real work, hard work – it’s an investment in relationships, not a perfunctory search to fill an opening.

During an engagement, I’m no longer incentivized to poke my nose into any and all issues … instead, I’m necessarily focused on delivering the value I said I’d deliver. That’s a huge change for me, and I’m still adjusting to it. Interestingly, I’m no longer rewarded for being the “smartest guy in the room” (the indispensable consultant, the guy you can’t get rid of) … in many ways I have the opposite incentive. If I collaborate well with employees to accomplish the desired result, I’m actually delivering MORE value than if I did all the work myself (teach a man to fish …). If I make employees look good and inspire them to do more of the work required to generate a result, then I’m increasing my own efficiency and margins. That’s a fascinating change of perspective for me, and I suspect it’s going to teach me a lot.

I have not mastered this form of consulting yet. I have a LOT to learn. For example, the engagement I just finished was supposed to take me 8 weeks, and it took 15. I was off by almost 100%. That’s OK – it was my first engagement, and I was going through a massive transition. There were so many details to figure out. Ultimately, I think I missed on my timing estimates because I rushed the sale – I should have slowed down a bit to get a more precise understanding of scope. I should have invested more upfront time in the sales process. To that end, I’m looking to start a more substantial second engagement with the same customer, but I’m spending a couple of weeks up front to make sure we both understand the scope better before finalizing terms. I assume I’ll get better at estimating as I complete the whole cycle a few times. I’m going to have to.

In the past, I’ve always liked getting to the point in a project where I’m not thinking about money at all, where I’m just focused on the job. It turns out that this might not be in the best interest of my customers. Money is a really useful way to keep score, especially if your goal is to create the most value you can. In the end, the goal of consulting is simple – to improve the client’s condition. My goal is to create so much value for the client that assigning some of that value back to me in the form of compensation is easy and obvious and lucrative.

I haven’t come close to figuring this all out yet, but I think I’m on the right path, so I’m going to keep moving forward. It feels like it’s going to take about three years before I’m an expert in this kind of consulting, and I couldn’t be more excited about the journey.

Of course, it’s not really about me – it’s about my customers. This has been such an intense ride for me that I couldn’t resist writing about my experiences so far. But soon I’ll start writing about the actual problems I’m solving, like migrating a data center to the cloud to improve disaster recovery. That’s pretty fun, too.