The Britney Spears approach to facilities operations: “Oops, I did it again.”

Posted by on Jul 13, 2012 in Facilities Management, Incidents/Downtime, Leadership | 3 comments

I can only speculate as to what caused Amazon’s latest outage, an apparent “loss of power.”  But this week, I’m going to express my opinions in no uncertain terms – fair warning.  In my experience, most organizations actually CHOOSE to have outages.  I don’t care what their sales slogans promise.  They choose to have outages.  If you don’t believe me, just read their SLAs (Service Level Agreements).  Most offer some sort of guarantee of uptime or service availability.  Amazon guarantees 99.95 percent uptime – or about 0.72 minutes of downtime a day.  It translates to more than four hours a year.  Beyond that, most will give you “credit” toward the loss of service with either billing credit or more services.  So as long as the outage is less than four hours per year, no foul.  You might even get a “We’re sorry.”  Rackspace offers a 100 percent uptime guarantee but will only reimburse 5 percent of your monthly fee for every half hour of outage.  So if you have 10 hours of downtime, you don’t have to pay the monthly fee.  Not a great option if your business is global and your average revenue is a million dollars/hour.

We all understand the business factors that drive the Britney Spears approach.  If you think about it, why would you spend millions of dollars on a facilities organization and all the mission critical processes when the average utility loses power maybe once a year?  Why would you invest in very costly mission critical infrastructure – especially when you can point to a “power outage” to explain poor service?

Buyer beware

But I look at how this approach ripples through their facilities organizations and affects how they operate their facilities.  In reality, I see that many of the co-location, cloud, and IT-as-a-service providers have dealt with this equation and elected not to invest in infrastructure or facilities personnel.  Buyer Beware: These companies are trusting in the utilities’ stable infrastructure to meet their SLA.  If something happens and there is an outage, they’re prepared to lose a little of their margins and send out the apologies.

From a business perspective, this may make sense to the service providers; however, it can be fatal to any business depending on the promise of uptime.  Imagine Nordstrom’s network sales infrastructure going down for a couple of hours the Friday after Thanksgiving.  I doubt that an apology from Amazon, even with the prospect of a billing credit, would compensate for the lost revenue and damaged reputation from the outage.  In the current state of affairs, this is exactly what can happen and why it is so important to understand how downtime can affect a company’s bottom line.  But unfortunately, until clients demand revenue replacement as the penalty for not meeting an SLA or until they refuse to sign up for these SLAs and start building their own data centers, business will continue as usual.  It’s a shame.

Amazon’s latest “Oops, we did it again” highlights the susceptibility of cloud services, in their current state, to the same risks as dedicated or co-location facilities.  As I write this, I see that other data centers have had outages too (Equinix and Level3).  Some outages just cannot be avoided (meteors, lightning strikes, earthquakes, et cetera), but the loss of a UPS or the failure of a generator to start should not be one of them.  Isn’t that why we buy and install UPSs and generators to begin with?  The fact is that a simple loss of power should not cause an outage.  I’m betting the problem is the way the facilities are maintained and operated.  Lack of battery maintenance, lack of training, or lack of configuration control are all main causes that can be avoided if management is willing to address the issues and spend the resources.  But as I stated before, the decision by management to provide these resources actually becomes a decision to have an outage or not.

So, channeling Dirty Harry to the best of my ability, if you’re a CIO or CTO responsible for your company’s IT, “Youve got to ask yourself one question: Do I feel lucky?”  If not, then I recommend that you find a company that actually supports its facilities department and provides them the resources to actually mitigate the risks.  Don’t simply tour the site and talk to sales people; talk to the facilities staff that run the facility.  Get the real story about how much risk there is at the facility that you’re considering if you want to protect your computers and, more to the point, your career.

3 Comments

  1. Excellent article, could not agree more.

  2. Terry,

    Fantastic article. I have forwarded it to many of my contacts. You have coined a new phrase to describe 99% of facilities. We see and live this daily.

    Very well written and to the point.

  3. Hey Terry,

    Long time no see! Hope you are doing well.

    Amazon’s recent outage affected some of my clients, and I admit it created a big mess. But the outage was not due to a power outage; the outage was due to software.

    Basically Amazon lost power to one of their data centers, but AWS was engineered to take the hit. So ELB (elastic load balancing) stepped in, and allowed clients who paid an additional fee to ‘fail over’ to another Amazon availability zone.

    Unfortunately, ELB nuked the whole cloud, because it was unable to deal with the flood of failovers which occurred.

    Long story short – the failure was due to poorly tested/designed software.

Leave a Reply to Stephen Bryant CDCDP CDCEP Cancel reply

Your email address will not be published. Required fields are marked *