Understanding human-caused downtime

Posted by on Nov 18, 2011 in Incidents/Downtime | 3 comments

In my thirty or so years of working in mission critical facilities, I have studied and investigated many incidents involving human-caused downtime.  Most of these incidents fall into five major groupings – all preventable.

Communication Errors

Spoken communication is tough.  If you don’t believe it, just ask the people working on Siri, Dragon, or other speech-recognition software.  Local slang, vernacular, pronunciations, and meanings can add confusion and misunderstanding.   When I reported to my first submarine in the Navy, there were announcements being made over the boat’s PA system that I didn’t understand for a couple of weeks.   Usage of abbreviations, local designations, and “speed announcing” made it difficult to understand.   Another problem I noticed:  For those that had been there for some time, the announcements actually faded into the background noise…another very dangerous situation, especially since these were important safety announcements.

 Have you ever listened to a song on the radio and then later realized the actual lyrics were something entirely different than what you thought?  Our minds can play tricks on us.  Oftentimes, we hear what we want to hear or expect to hear (Hearing What We Want to Hear, 4/1997, Chenausky).   Add to that communications that are not clear…such as using letters like “C”, “B”, and “D” within spoken operational orders and you start to appreciate the complexities that we interject into our communications.  How we communicate can add risk to our operations.

How to prevent:  Develop, train, implement, and enforce a formalized spoken communication protocol with mandatory repeat-backs.  Eliminate confusing designations; use a phonetic alphabet for your site’s letter designations.  Provide all personnel with list of authorized abbreviations for use at the site.  Leaders must enforce and become examples of this policy and practice.

Inattention

Inattention can be caused by many things – fatigue, emotional state, intoxication, medical condition, local distractions (sounds, sights, other personnel, etc.).  Ever been driving a car and felt your head snap up as you were driving late at night and realize that you just took a “nap” at 70 mph?  Has your mind ever wandered during a conversation to where you literally can’t remember what was just said and have to ask the speaker to repeat themselves for you?

Our minds naturally wander.  It’s what we do.  Our brains process information much faster than is being received.   This gives the brain extra time to access memories, process relationships, and try to make sense of what it just received.  Sometimes this processing activity can take control and we “daydream” or “lose focus.”  Whatever you call it, it can cause the brain to focus on something other than that which is critical for proper operation, communication, or observation.  However momentary it may be, it can create real problems.

How to prevent:  Require that the supervisor for that shift/period make an assessment of each of the operational staff for fitness for duty.  This doesn’t need to be a formal interview, but a lot can be discovered by seeing each person and just asking a few questions.  Having each person provide a turnover status for each of their areas at a pre-shift turnover meeting could satisfy this.   Send anyone home that isn’t ready for the rigors of critical operations.

For activities that are critical to the safety and continued operation of the site, make it a practice that those activities must be accomplished by two people.  This practice is used by the military, nuclear, airlines, and other mission critical environments.  Having a second person checking actions and reading the procedure can prevent mistakes and make the activity interactive, reducing the risk of inattention.

Documentation Errors

Ever follow your vehicle’s GPS to a dead end or someplace where you cannot get to your destination because the roads have changed?  You have been victim of a documentation error.  It’s the same for the operations staff that uses outdated or incorrect documentation for the activity being performed.  Using a drawing that has not been updated since the last system upgrades were done, using a procedure that is not up to date or actually doesn’t work are examples of documentation errors. 

How to prevent:  Develop and implement a formalized documentation control program.  Do not allow operation of your plant with anything other than controlled documents.  Implement a formalized process to validate your procedures.  Ensure that sufficient information which is in the correct sequence is provided for the operations team to successfully complete the activity.  I recommend formal engineering development of all critical activity procedures and processes.

Incorrect/Missing Labeling

Imagine being in a large steel tank, with pipes and literally hundreds of valves.  Then imagine water gushing into that tank.  Now imagine you have about 2 minutes to discover which valve is the water valve that shuts off the flow of water into the tank before  you drown (some of you will recognize this scenario from submarine school training).  Oh, by the way, none of the valves are labeled or color coded.  It would have been nice to find the valve that said “water shut-off valve.”  In training, they never make it that simple.

In critical environments, when we are asked to perform an activity, it is vital to know that we are operating the correct valves and switches in accordance with procedure.  The procedures need to have valve/switch designations that match exactly the labeling in the plant.  I have seen a simple cable labeling error shut down an operating nuclear power plant.

How to prevent:  Implement a plant-standardized labeling and color-coding program.  If you have a procedure to operate it, it needs to be labeled or coded as it appears in the procedure.  A great practice is to provide a schematic of electrical circuits or flow path on the equipment itself to aid in operator understanding.  Properly done, every switch, valve, or operator will be uniquely labeled and easily understood.

Lack Of System/Process Understanding

While working at a laboratory, I observed a lab technician recount a standard sample several times.  She stated that she had to recount the standard sometimes as many as seven to ten times to get an “acceptable” reading from the radiation measuring device.   It was about then I administratively shut down the lab.  The lab technician was invalidating a statistical process to verify the radiation measuring device was operating correctly!  The ramification was that the lab was potentially releasing radioactive materials into the general public.  You can imagine the response that this caused.  Every sample that this machine was used on had to be re-analyzed, the public was notified, and the incident literally made the evening’s national news, all because of a technician’s lack of understanding of the process.  

You can have operators following procedures verbatim, but if they don’t understand the expected system responses, they can misinterpret what is happening, with resultant downtime or worse.   Incidents that to some degree fall into this category are Three Mile Island, Bhopal, and the Challenger disaster.

How to prevent:  Training, training, training.  This is a solution that is not all that easy to implement within restrained budgets, limited training resources, and limited time, but there is no other way to fix this.  There are some methods to stretch your training resources, but it must be done one way or another.  The training program needs to implement some form of refresher or re-certification process along with lessons learned from plant operational experiences.

I hope that this article provides some insight to human-caused downtime incidents.  The prevention methods that I have listed have been used for years and proven in many mission critical environments to prevent human-caused incidents.  I hope that you can use some of this in your facilities, and I’m always open to new ideas on how to prevent human-caused incidents.

3 Comments

  1. Great Article. Very helpful for root analysis and training.

  2. Great Blog!!!

  3. Great article Terry, I agree with all you said. I also believe there are even more reasons, both cognitive and behavioural why we make human errors and system design + effective training and awareness can be the necessary counter-balance to these factors. Please get in contact if you would like to continue this discussion. i look forward to reading your next post. Best regards phil

Leave a Reply to Phil Smith Cancel reply

Your email address will not be published. Required fields are marked *