Achieving High Data Center Availability through Fault Avoidance
A tier four data center is often described as fault-tolerant. To be fault-tolerant, a data center must have two parallel power and cooling systems with no single point of failure (also known as 2N). Building or co-locating in a tier four center is hardly cost-effective for most companies, and the jump from tier three to tier four provides a marginal gain in availability. However, infrastructure isn’t the only factor that plays into availability, and all data center owners/operators can see an improvement in uptime by going a step beyond fault tolerance and practicing “fault avoidance.” Without fault avoidance, tier numbers mean little in terms of availability.
This year, the Uptime Institute addressed the concept of fault avoidance at their Executive Symposium in a session titled “Beyond Fault Tolerance.” Rather than reacting to a problem once it has occurred, fault avoidance focuses on preventing those problems from occurring in the first place. Many complications that lead to data center downtime can be prevented with equipment and systems monitoring, formalized staff training, thorough procedures, and regular maintenance.
Monitoring and Trend Analysis
To construct a data center with fault avoidance in mind, operators must ensure a Building Management System (BMS) or a Building Automation System (BAS) is installed. A BMS allows operators to monitor systems and draw insights, while a BAS allows operators to both monitor and automate responses based on the data collected. These systems use Programmable Logic Controllers or PLCs that constantly check the health status of data center equipment by reading data gathered from equipment sensors. Using these tools, operators can monitor equipment at the individual level, or monitor the environment as a whole and analyze trends at the building level. The ability to monitor the building as a whole is often referred to as a “single pane of glass” monitoring system. There are many different types of monitoring systems to choose from, and these can be purchased as a complete solution or they can be custom-built. Drawing insights from the data delivered by these systems, operators can anticipate problems before they occur and quickly troubleshoot issues when they do arise.
Formalized Training & MOPs
Because human error is one of the primary causes of downtime, formalized training and procedures are just as important as a data center’s infrastructure. A data center is a complex environment with interdependencies between multiple systems. Having thoroughly scripted and strictly adhered to Methods of Procedure (MOPs) prevents unexpected events from occurring, and from exacerbating incidents when they do occur.
Data Foundry’s Vice President of Operations, Cameron Wynne, explains, “MOPs should contain a thorough, step-by-step description of the steps to be carried out and the expected result of every action to be taken. They should also include safety requirements, prerequisites and back-out procedures, should unexpected events occur.” Many of these procedures are developed during a data center’s commissioning process. Ensuring that everyone on the facilities team is trained to carry out these procedures is critical to minimizing downtime. Any deviation from procedures can lead to errors.
24×7 Staff and PAWs
To effectively prevent and minimize incidents in the data center, it is essential to have a 24×7 facilities team and a designated Primary Alert Watcher – commonly referred to as PAW. Our Texas 1 Facility Manager, AJ Klundt, says, “In addition to automatic alerts and alarm notifications, the PAW notifies the rest of the team in the moment any alert is detected, resulting in an immediate operator response. This prevents bigger problems and resolution can happen faster.”
In addition to monitoring infrastructure 24×7, staff overtime should be monitored as well. Ideally, data center facilities teams will be cohesive, look out for one another, and ensure that no one is overworked and fatigued on the job. Fatigued staff leads to an increase in incidents.
Preventative & Predictive Maintenance
To practice fault avoidance, operators should adhere to the recommended guidelines proposed by Original Equipment Manufacturers (OEMs) and ensure equipment is properly maintained. Instead of reactively repairing equipment, careful operators use preventative and predictive maintenance to prevent incidents and downtime. Predictive maintenance involves monitoring equipment and interpreting the data to understand when a machine or component is likely to fail, while preventative maintenance involves less monitoring and planned maintenance at regular intervals. Vigilant operators use a combination of both, as some components cannot be monitored, and following a maintenance schedule does not guarantee that nothing will fail.
If you are looking for a data center provider, it’s a good idea to talk to them about their operating procedures and what steps and systems they have in place to prevent downtime. A data center with a single pain of glass monitoring system, thorough MOPs, regular SOC audits and 24×7 staff onsite will be significantly more reliable than those that do not.