Ripple effects of lights-out data centers

Posted by on Jan 20, 2012 in Facilities Management, Industry Trends | 0 comments

One of the latest developments in the data center world is operating in a “lights-out” fashion.   This operational model of data centers means simply operating with no on-site personnel.  The model relies on having automated backup systems, redundant data-center assets and/or software to maintain the appearance of 100 percent uptime.  The building itself is protected by its secret, unmarked location, by physical security, and by remote monitoring.

So what do lights-out data centers mean to facilities?  How does it affect operations and methods?  What does it mean for our staffing models, spare parts inventories, maintenance programs, and training initiatives?

Two Competing Philosophies:
Since the data center operates without human presence (hence “lights out”), one of two strategies will emerge as the method for how to achieve the delivery of services:  (1) Either the data center’s capability must be redundant within the overall content delivery/processing network (spare data centers)  or (2) the data center itself must be constructed and engineered in such a manner that it is automatically robust enough to maintain services and operations for most all foreseeable events/situations.

(1)    In the data-center-redundancy philosophy, the controlling software (cloud administration) sees the data center as its components of processing capability, storage capability, and latency to consumers.  Based upon controlling algorithms, the controlling software selects the assets that provide the maximum efficiency and effectiveness for the client requirements.  It also must be prepared to go to the next best available asset should the primary asset become unavailable.

In this case, the assets in use must, in case of primary failure, be redundant in two diverse geographical locations to prevent loss of data or processing capability.  As you can see with this situation, the network must contain at least N+1, and more likely 2N, capacity in the event of data-center loss.  This reserve or back-up capacity is at the data-center level as opposed to the equipment or system level.

In the era of mega data centers, this philosophy would become very capital intensive very quickly.  One interesting note, I believe that the economics of this philosophy are a factor that will tend to drive the construction of smaller and more distributed data centers.  (See my earlier blog post, “Future of Data Centers.”)  With redundancy occurring at the data-center level, would you even need back-up generators?  It makes for an interesting option that offers the prospect of significant reduction of costs.

(2)   When I think of the second philosophy, I think of nuclear power plants and their redundant systems inside the containment areas (the large, concrete domes).  The in-containment areas are not accessible due to lethal radiation levels during and shortly after reactor operations.  When the reactor is operating, these systems must work or the entire plant must shut down.  The in-containment systems have at least full redundancy, and many have triple redundancy.  This type of redundancy would have to exist at the lights-out data center.  The systems would have to be designed and engineered based upon the availability of the human interaction, mitigation, or intervention.

The systems/equipment of the lights-out data center must keep the site operational until someone can get to the site to mitigate the situation or make repairs to the primary or redundant system.  Worst case is that the data center goes down and the services it provides are unavailable until human intervention can take place.  In some cases, this may be hours or even days.

This type of engineering and system redundancy is very expensive.   Systems that are engineered to nuclear standards are normally 10 to 20 times more expensive than systems built to less stringent codes and performance criteria, the cost of “failure is not an option.”

So for the second philosophy, a more realistic approach might be to build to the Tier IV level as defined by the Uptime Institute with comprehensive sensors and monitoring.  One process that cannot be eliminated, whether accomplished at the site or remotely, is the monitoring of the site because there is the inevitability that all machines fail – and do so often in an unpredictable manner – introducing risk to the operation.

Now that we understand the philosophies of lights-out data centers, let’s look at the functional areas that are affected.

With no one at the site, the site must be remotely monitored by people who are trained in what the parameters and alarms mean and the implications to the operation of the data center.   If system line-ups need to be changed or adjusted, they must be controlled remotely or operators must be dispatched.  Personnel must come from somewhere – whether they are centrally located or distributed throughout the area – and this takes time.  The response time depends, of course, on the location of the responders, weather, time of day/night, traffic, et cetera.

Maintenance is done only when personnel are on site.  For the most part, maintenance must be coordinated with the need to deploy these maintenance personnel to other facilities.   Pre-positioning strategies need to be developed to ensure that the proper tools and/or spares are available at or brought to the site when needed.

Without full-time personnel at the site, training will include orienting and qualifying the technicians for operations, emergency response, and other policies/procedures.  Personnel will need frequent re-training to make sure they stay up to date on changes and refresh their memories.  (This may increase costs.)  This training would extend to anyone who visits or monitors the site in an operational or maintenance capacity.

Incident Response
Because there are no people on site, response-time assumptions and expectations will have to be revised from minutes to hours.  Systems and processes would have to be re-engineered to account for this change.  For example, an incident such as a fire or flood may cause the complete loss of a facility due to response times.

The number of man-hours required for maintenance remains the same regardless if people are at the site or not.  Some staffing efficiencies could be obtained by proper scheduling for operations, maintenance, and monitoring.  Staffing is probably where the largest savings could be realized.  As far as facilities staffing is concerned, you probably could save about 40 to 50 percent if everything is leveraged efficiently over several data centers as opposed to across a single data center.  This is just a preliminary look and needs to be analyzed for each situation.

The loss of the opportunity to innovate could have serious negative consequences.  When people are at the data center, their brains are engaged with the site and this interaction is what spawns innovation.  One way to combat this is to do the work with in-house staff and cultural/policy requirements.  It may help, but you will never have the innovation that could exist from constant site staffing.

Organizational Psychology/Culture
Organizational psychology and culture come into play through the concept of effective ownership.  When people have too much to “own” they won’t take effective ownership; they try to manage the workload but never actually achieve true ownership.  The converse yields the same result:  When people have too little to “own,” they own it for a short time until they master it; then they get bored and look for other things to do.  When individuals “own” too many data centers, they may just try to manage the workload.  When they “own” too few, they can become distracted.  In either event, the important organizational culture of ownership is lost.

The choice to go to a lights-out-data-center-operations model has its pros and cons.  On the one hand, it could achieve very attractive short-term financial goals.  On the other hand, the choice may come with some pretty severe consequences if things go wrong or are not set up properly.  From the perspective of operations, these sites need to be evaluated and engineered properly to address all possible contingencies.

Predictably, the lights-out-data-center decision actually comes down to how much risk you and your company are willing to accept.  Looking at the first lights-out philosophy (operating with redundancy at the data-center level), the cost of network infrastructure and software development may mean this option is not viable – well, not yet anyway – or it could open up new markets for software/network control.

Leave a Comment

Your email address will not be published. Required fields are marked *