Recent high profile data center outages have again brought to fore that while a lot of equipment and facility investments have been made on redundancy and disaster recovery, there is still high reliance on manual operations. Surveys have indicated that human error ranks as the second highest causal factor in data center outages. This in turn has been attributed to failure in adherence to standard operating procedures (SOPs) which are usually well defined but forgotten – or worse, not made aware to operating staff.
There are several pitfalls of keeping the SOP as a manual and not automating the procedures. The logical home for automated procedures is the DCIM (Data Center Infrastructure Management) which essentially is an Operations, Planning and Management software for a Data Center. These set of operational procedures are packaged into a “DCIM Policies” framework which link into different modules of the DCIM Software such that the DCIM detects any potential violation and sends alerts.
There are 12 key operating procedures that should be part of “DCIM Policies”. These policies broadly fall under three major categories:Risk Management, Governance and Efficiency Management. I have written this blog in 3 parts. This is Part 1 of the blog where I have discussed about the first category that is Risk Management and the DCIM policies that fall under it.
- Risk Management: This tries to mitigate a Data Center Manager’s nightmare of an unplanned downtime, or worse an extended outage that disrupts business application availability, causes massive financial loss and damages an organization’s reputation. The policies that fall under Risk Management are Alarm Policy, Escalation Policy, Redundancy Policy and Disaster Recovery Policy.
- Alarm Policy: helps to decide which devices and parameters need what frequency of monitoring, and defining their threshold levels in the system. Consider expected operating temperature and humidity range as an example. Ideally, we should include the operating temperature and humidity ranges at device-level, at rack-level, at the row-level: for each hot and cold aisle, and at the room-level: for general comfort of operating staff. This is a high-priority decision factor under DCIM alarm policy to prevent smoke, fire or damage to devices.
- Escalation Policy: It is important to establish a clear-cut escalation process to know as to how and when alerts should be escalated in a data center. Escalation policy need to be developed and rehearsed to ensure the chain of command is informed and the appropriate resources are brought to bear as any situation develops. An escalation table can be defined in DCIM, which outlines the protocol, channels for escalating issues and contact personnel with the appropriate expertise.
- Redundancy Policy: is important to be defined in DCIM depending on customer’s needs i.e. whether to have an N+1, N+2 configuration. It is not just redundant components that are important but also the process to test and make sure they work reliably such as scheduled failover drills and research into new methodologies. If we cannot have two, we need to figure out how we can cobble together a replacement system if the primary equipment becomes unavailable or fails.
- Disaster Recovery Policy: It is crucial to have a disaster recovery plan in place with metrics of RTO (Recovery Time Objective) and RPO (Recovery Point Objective) well defined in the SLA. A data center disaster is considered when none of the redundancy options are available: a complete power outage, for example, is a disaster. In such a situation, how quickly can we recover to get at least one section of the data center up and running (RTO). How much longer will it take to recover to the point before the initial power failure leading to complete outage (RPO).
Every process/operating procedure involved within the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for immediate corrective action and where possible even prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. In my next blog I will share about streamlined governance and best practices that apply to data centers. For more information on how to derive benefits from DCIM policy based systems, download the Greenfield Software white paper “DCIM Policies: Automating Data Center Standard Operating Procedures”.