Incidents/Downtime

Reliability-centered maintenance and data centers

Posted by on Jan 27, 2012 in Facilities Management, Incidents/Downtime | 0 comments

Among the popular buzzwords bandied about in the data center industry today is Reliability-Centered Maintenance or RCM.  The term is so prevalent in the industry’s marketing lexicon at this point that it’s hard to tell which companies really understand what it is – or how to do it.  In fact, many people in the industry are unaware that there is a standard by which RCM practices are measured, and a governing body that sets the standard. Reliability-Centered Maintenance (RCM) is an engineering study conducted to determine the best course of action for maintaining a particular system or process.  The Society of Automotive Engineers (SAE) defined the RCM process in their technical standard SAE JA1011, Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes (1998).  The SAE standard sets out the minimum criteria that must be met before a process can legitimately be called RCM. The standard consists of seven questions that must be answered – and answered in this order – for the process to be called RCM:...

Read More

Emergency Procedures: Help or hindrance

Posted by on Dec 16, 2011 in Facilities Management, Incidents/Downtime | 1 comment

Your fire alarm goes off.  The sirens blare, the strobe lights flash, and some sort of mechanized voice keeps informing you that there is a fire and you must go to the nearest exit.  Most of the people in the facility do exactly that – head for the nearest exit.  But what about your facilities staff?  What are they doing? I can see them now, calmly going to the emergency procedure manual, carefully reviewing the index to select the right procedure, then diligently reading and checking off each step in the procedure precisely as they were trained.  Never mind that the two-page procedure incorporates a 4-page checklist that would take an hour to complete, if it were actually up to date.  (Well, maybe you could consider the procedure up to date if you include the five Post-It notes on various pages that add the few minor things that were left out of the original procedure – little reminders like remember to check the power to the backup system and the current facility manager’s correct cell phone number....

Read More

Understanding human-caused downtime

Posted by on Nov 18, 2011 in Incidents/Downtime | 3 comments

In my thirty or so years of working in mission critical facilities, I have studied and investigated many incidents involving human-caused downtime.  Most of these incidents fall into five major groupings – all preventable. Communication Errors Spoken communication is tough.  If you don’t believe it, just ask the people working on Siri, Dragon, or other speech-recognition software.  Local slang, vernacular, pronunciations, and meanings can add confusion and misunderstanding.   When I reported to my first submarine in the Navy, there were announcements being made over the boat’s PA system that I didn’t understand for a couple of weeks.   Usage of abbreviations, local designations, and “speed announcing” made it difficult to understand.   Another problem I noticed:  For those that had been there for some time, the announcements actually faded into the background noise…another very dangerous situation, especially since these were important safety announcements.  Have you ever listened to a song on the radio and then later realized the actual lyrics were something entirely different than what you thought?  Our minds can play tricks on us.  Oftentimes, we hear what we want to hear or expect to hear (Hearing What We Want to Hear, 4/1997, Chenausky).   Add to that communications that are not clear…such as using letters like “C”, “B”, and “D” within spoken operational orders and you start to appreciate the complexities that we interject into our communications.  How we communicate can add risk to our operations....

Read More