Posted by on Apr 23, 2012

In the mission critical environments industry, we often talk about that “failure is not an option” – and for the most part, we believe that and work toward that goal.  But the stark reality is that failure is inevitable.  At some point in the future, everything will fail.  We do not have unlimited resources, nor do we have perfect engineering or flawless operations.  Whether we look at air travel, nuclear power plants, or even the brakes on our cars, failure occurs.  We build backup systems for those inevitable failures, but even the backup systems will fail.  I have seen quadruple backup processes fail.  The only thing we can do is try to mitigate the results of failure and/or how often it occurs.

So how do we cope with inherently dangerous or expensive processes and their inevitable failure?  In some cases, we deem failure an “acceptable risk.”  We calculate the chance of failure as Mean Time Between Failure (MTBF), the likelihood of an event is 1 in 200,000 years, 100-year flood plains, et cetera.  The number of deaths per passenger mile on commercial airlines in the United States between 1995 and 2000 was about three deaths per 10 billion passenger miles.  Not bad, but not perfect either – three people died per 10 billion passenger miles.  Is that an acceptable risk?  Probably not if you’re one of the three.  But even in the highly regulated, inspected and trained world of commercial air travel, failures occur.  What I wonder is, If we doubled the resources that we use in that industry for safety, would we see a reduction to 1.5 deaths per 10 billion passenger miles?  What if we spent ten times what we do now?  Would the statistic be reduced to 0.3 deaths per 10 billion passenger miles?  At what point do we run out of resources, and can we ever get it to zero failures?  The answer is no, there will always be something unforeseen – just ask the management of the Fukushima nuclear power plants.

The reality of how our world works is that our customers, clients, and users of our services and processes actually dictate what failure rate and what consequences are “acceptable.”  They determine this by their choices on what they purchase and how much.  As an example, would you pay $10 to fly from Los Angeles to New York if you knew there was a 1 in 100 chance of the plane crashing?  Would you pay $500 for the same flight if there was a 1 in 100,000,000 chance the plane would crash?  (Those are about the real odds, actually.)  Would you be willing to pay more for less of a chance of failure?  At what level are you happy to take a chance?  You do it all the time.  Every time you get behind the wheel of a car, ride your bike, go skiing, or just go outside for a walk, you are taking chances.

So how does this apply to data centers and operations?  At one of the companies I worked for, the “customer” experience was affected by 250 milliseconds if we had a data center failure.  That is, if they had to wait a quarter of a second longer for their service, it was disastrous.  For most companies, a quarter of a second lag time is not even a noticeable inconvenience; but to a trader of currency such as a bank, 250 milliseconds could mean millions lost.  In many situations, the failure of a data center is an option; but for the traders in a bank, it is not a very good one.  Yet even with a multitude of backup and engineering there will come a time when even the well-designed and operated banking systems will fail.  We can engineer the heck out of things and still not account for all the possibilities that will cause a failure.  I guess the real point that I wanted to make here was that failure is inevitable.

Humans…we are sometimes referred to in the failure equations as “primary contributors” and need to be “engineered out.”  I agree that humans do cause a lot of failures, but I have a different opinion as to how they should be treated in the equation.  I believe, and have seen, that properly trained humans taking actions in the moment can and do avert failures from materializing or significantly reduce the effects of failure.  Remember CaptainSully” Sullenberger who successfully landed a plane in the Hudson River and saved all 155 lives?  I’m not sure that any technology or backup systems exist that could have done what he did.  I had a mechanic one day tell me that he heard a bearing about to “go out” and said that we should replace it.  We did and found that the bearing was indeed about to self-destruct.  The repair cost us a new bearing and some labor, but no unplanned downtime or major repair costs.  I’m sure that the story would have been much different had we run that bearing to failure.  It would have caused us to shut down a manufacturing line.

While people can be your biggest liability, you can turn them into your greatest asset.  Training, experience, and process/plant knowledge can combine to prevent and mitigate failures.  Aware personnel are walking “sound and vibration analyzers.”  They smell and see things that are out of place or abnormal and, in doing so, they observe and act on the precursors to failure.  They are, in fact, your best line of defense against failure.  Don’t they deserve that recognition and the resources to promote and enhance that ability?

Failure is inevitable, but people can be the part that saves the operation, if they are properly prepared and given appropriate authority.


  1. I have listened to many webinars and read many articles regarding risk management. It seems that no one really takes fuel quality regarding the standby generators seriously. At best, they depend on the company they have a preventative maintenance agreement with to do this for them. I would prefer to control my own fate and install an automated ,programable fuel filtration system to ensure optimum fuel quality to my gensets at all time. Do you think this is a sense able action?

    • I would agree. There are many that don’t even think about the quality of their fuel and I have seen it set for years without any care….resulting in the failure of the diesels. Thanks for bringing up this topic!!

  2. I agree – with proper training and good communication between IT and Facility team can prevent and reduce failure time. Thanks.

