DCIM Policies: Automating Data Center Standard Operating Procedures (Part 3)

Our last two blogs on DCIM Policies discussed “Risk Management” and “Governance.” Risk Management covered Alarm, Escalation, Redundancy and Disaster Recovery Policies. Governance covered Security, Data Retention, Approval and SLA Policies.

This last part will cover “Efficiency Management”: a set of critical KPIs that form the core of a Data Center Manager’s Handbook. The Green Grid, ASHRAE and Uptime Institute have defined number of KPIs for an energy and operationally efficient data center. Typically, these KPIs appear on the DCIM dashboard. Four policies are being covered in this section: PUE Policy, Rack Load Policy, Replacement Policy and Preventive Maintenance Policy.

PUE Policy: The power usage effectiveness (pue) metric is an industry standard for reporting energy performance of data centers. Organizations need to take several measures to ensure better pue. PUE policies in DCIM would be as follows:

a) PUE range values: A data center may define maximum acceptable average annualized PUE depending on external temperature conditions. Alerts would be sent accordingly. Newer data centers (or where DCIM has been recently implemented) which do not have a year’s PUE values maintain a daily/weekly/monthly/quarterly average.
b) UPS load: Matching UPS load to the system load improves PUE. If the UPS is only loaded to 30% capacity, efficiency will be much lower. Hence, we may define a lower threshold level of UPS load which should generate alert. An upper level load must also be defined to maintain balance of power load of the downstream devices connected.
c) Carbon Usage Effectiveness (CUE): Green Grid, the authors of PUE have also defined another metric, CUE which is dependent on PUE. Sustainability-conscious organizations, maintain CUE as another metric and may ask for this to be included as well for generating alerts.

2. Rack Load Policy: A data center must have a proper rack load policy in place in terms of power load, temperature, weight, U-space and ownership allocation. Threshold or procedure breaches in rack loads need to generate on-screen warnings or alerts.

a) Rack Power: Racks are allocated power loads, say 8KW. If already loaded with devices running up to 7.5KW, then a rejection should first happen if the workflow approval request had this Rack as an option to place a server of 900W. If the operator still attempts to configure the DCIM with this server, an on-screen warning would be displayed. If the operator still places the server, and the rack load has jumped beyond 8KW, then immediately a critical alert would be sent as per escalation policy.
b) Rack Temperature: Rack temperatures are defined under alarm settings. If temperatures exceed thresholds, alerts would be sent.
c) Rack Weight: Depending on floor load bearing capacity, a certain weight capacity is allocated for each Rack. Alerts can be configured accordingly.
d) Rack U-space: Typically some U-spaces in the rack are kept free, which should be defined. If not an alert, at least an on-screen warning should appear when an operator is committing this procedure breach.
e) Rack Ownership: Racks or even U-spaces may be allocated to a business owner. Placing a device of a different owner on this should generate a warning or alert.

3. Replacement Policy: In this policy, we define life for each category of device in the Data Center.

a) Alerts can be configured when a device is coming near end of life. This helps in decommission planning.

b) Alerts could also be set-up before the actual replacement so that affected users can make contingency plans should something go wrong during the transition.

4. Preventive Maintenance Schedule Policy: As common practice, most changes in data center are planned during non-critical periods. Preventive Maintenance and upgrade schedules with expected downtimes can be defined in DCIM. The following can then be configured:

a) Switching off non-reachability alert during this downtime
b) If actual downtime exceeds expected downtime by a certain margin, alert would be sent
c) Validating from Power and Network Chains that scheduled preventive maintenance of a device does not have a cascading impact. If it does, an alert would be generated.

Summary:

Each operating procedure in the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for corrective action and where possible prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. With increased adoption of DCIM as operations, planning and management software for data centers, Standard Operating Procedures (a la Policies) must form the core of an effective DCIM.

To learn more about DCIM Policies, please read the whitepaper…

The “DCIM Policies: Automating Data Center Standard Operating Procedures” whitepaper outlines the importance of automating data center standard operating procedures, and how these policies help to avoid data center failures, help in better governance and driving efficiency improvements.