Data Center Downtime Feb2011

Page 1

The Truth and Consequences of Data Center Downtime

Š 2011 Emerson Network Power


Emerson Network Power: The global leader in enabling Business-Critical Continuity Paralleling Switchgear

Automatic Transfer Switch

Fire Pump Controller Surge Protection

Uninterruptible Power Supplies & Batteries

Integrated Racks Perimeter Precision Cooling

Cold Aisle Containment

Cooling

Row Based Precision Cooling

Rack

Rack Power Distribution Unit

Extreme-Density Precision Cooling

KVM Switch Monitoring

Power Distribution Units Data Center Infrastructure Management

Š 2011 2010 Emerson Network Power

UPS


Emerson Network Power – An organization with established customers

Š 2011 2010 Emerson Network Power


Presentation topics • Emerson Network Power overview • National Survey on Data Center Downtime: Frequency, Duration and Cost, Dr. Larry Ponemon, Founder and President, Ponemon Institute • Preventing the Most Common Causes of Downtime: Root Cause Analysis, Best Practice Prevention and Technology, Peter Panfil, Vice President and General Manager, Liebert North America AC Power, Emerson Network Power • Question and Answer session

© 2011 2010 Emerson Network Power


National Survey on Data Center Downtime: Frequency, Duration and Cost Dr. Larry Ponemon Founder and President Ponemon Institute

Š 2011 Emerson Network Power


About the Ponemon Institute • The Institute is dedicated to advancing responsible information management practices that positively affect privacy, data protection and information security in business and government • The Institute conducts independent research, educates leaders from the private and public sectors and verifies the privacy and data protection practices of organizations • The Institute is a member of the Council of American Survey Research Organizations (CASRO), and Dr. Ponemon serves as CASRO’s chairman of Government and Public Affairs Committee of the Board • The Institute has assembled more than 50 leading multinational corporations called the RIM Council, which focuses the development and execution of ethical principles for the collection and use of personal data about people and households

© 2011 2010 Emerson Network Power


About the studies • Purpose: Determine the frequency and cost of unplanned data center outages • Study 1: 453 individuals in U.S. organizations who have responsibility for data center operations – Perceptions about data center criticality, availability and outages – Perception differences between executives and associates

• Study 2: Develop an activity-based costing model derived from actual meetings or site visits for 41 data centers that experienced a complete or partial unplanned data center outages to capture both direct and indirect costs related to: – – – – – –

Damage to mission critical data Impact of downtime on organizational productivity Damages to equipment and other assets Cost to detect and remediate systems and core business processes Legal and regulatory impact, including litigation defense cost Lost confidence and trust among key stakeholders

© 2011 2010 Emerson Network Power


Perceptions about data center availability

Agree: Combines strongly agree and agree responses Disagree: Combines strongly disagree, disagree and unsure responses Š 2011 2010 Emerson Network Power


Perception differences between senior management and operators

Supervisor and below Director and above Š 2011 2010 Emerson Network Power


Experience with unplanned data center outages Experienced one or more unplanned outages data center over the past 24 months

Frequency of unplanned data center outages over the past 24 months

Total data center outage: Entire facility is down Partial outage: Limited to individual rows and rack Device-level outage: Individual servers and IT units Š 2011 2010 Emerson Network Power


Extrapolated duration of data center outages in minutes

Total data center outage: Entire facility is down Partial outage: Limited to individual rows and rack Device-level outage: Individual servers and IT units Š 2011 2010 Emerson Network Power


Duration

Frequency

Extrapolated frequency of complete data center outages by square footage

Š 2011 2010 Emerson Network Power


Extrapolated frequency of complete data center outages by industry

Extrapolated frequency of unplanned outages over two years Š 2011 2010 Emerson Network Power


Study 2: Activity-based cost framework for the cost of data center outages

Interviewed and audited 41 data center managers who experienced an unplanned outage Š 2011 2010 Emerson Network Power


Cost loadings from ABC Framework Direct cost

Indirect cost

Opportunity cost

Total

Detection

52%

48%

0%

100%

Equipment cost

60%

40%

0%

100%

IT productivity loss

23%

77%

0%

100%

End-user productivity loss

22%

78%

0%

100%

Third parties

35%

41%

24%

100%

Recovery

22%

78%

0%

100%

Ex-post response

53%

47%

0%

100%

Lost revenue

33%

26%

41%

100%

Business disruption

24%

30%

45%

100%

Average contribution

36%

52%

12%

Cost activity centers

Interviewed and audited 41 data center managers who experienced an unplanned outage Š 2011 2010 Emerson Network Power


Average cost by category

Results shown are derived from the analysis of 41 data centers located in the United States Š 2011 2010 Emerson Network Power


Total cost by industry sector

The average duration of the outage for the 41 data centers was 102 minutes Š 2011 2010 Emerson Network Power


Total cost for partial and total shutdown

Results shown are derived from the analysis of 41 data centers located in the United States Š 2011 2010 Emerson Network Power


Preventing the Most Common Causes of Downtime: Root Cause Analysis, Best Practice Prevention and Technology Peter Panfil Vice President and General manager Liebert North America AC Power Emerson Network Power

Š 2011 Emerson Network Power


Were the unplanned outages during the 24 months preventable?

Š 2011 2010 Emerson Network Power


Total cost by industry sector

Data centers experienced multiple outages during the 24 month period surveyed Š 2011 2010 Emerson Network Power


#1: Battery failure • 65% of outages caused by battery failure A single bad cell among thousands can take down a facility

How?

Batteries have a limited life expectancy False confidence; no indication of problems until needed

• Service life of a battery varies, dependant on: – Frequency of usage – Ambient temperatures – Quality of connections and terminals

• The weakest link in critical power

© 2011 2010 Emerson Network Power


#1: Battery failure Best Practice: Preventive Maintenance • Service contracts for inspections and testing – Monthly, quarterly and annual actions need to be taken

© 2011 2010 Emerson Network Power


#1: Battery failure Best Practice: Real-Time Monitoring • Measure the internal DC resistance of all battery cells • Combination of hardware and software – Alarm management via email and SMS – Measures the reliability of the entire battery • Strap • Inter tier connections • Plates • Battery connection posts/ terminals

• Proactively indentify and replace bad batteries White Paper: Implementing Proactive Battery Management Strategies to Protect Your Critical Power System © 2011 2010 Emerson Network Power


#2: UPS capacity exceeded • 53% of outages caused by lack of UPS capacity IT gets added without knowledge of infrastructure impact

How?

Redundant UPS loaded over 50% Should UPS or battery failure occur, the remaining UPS cannot support 101% of the load

IT usage is variable, not static

• IT growth outpaces AC Power infrastructure growth • Disconnect between Facilities and IT – The owner of the UPS might not be IT

• Battery runtime is also dependant on how much load is being supported

© 2011 2010 Emerson Network Power


#2: UPS capacity exceeded Best Practice: Additional UPS Cores for capacity and redundancy • Keep redundant UPS capacities at 30% - 40% – IT load must not exceed the total capacity of a single UPS – Efficiency of the Liebert NXL optimized at partial loads

• Size the new UPS system on best-case growth • Real-time capacity monitoring to manage load balancing • UPS configured in a parallel redundant configuration

Some data centers willing to trade redundancy for capacity – analyze the costs, risks and benefits © 2011 2010 Emerson Network Power


#2: UPS capacity exceeded • Options for parallel redundant UPS UPS Core

UPS Core

UPS Core

STS

System Control Cabinet

UPS Core

UPS Core

UPS Core

SS

SS

SS

Paralleling Cabinet IT Load

IT Load

N+1

1+N

Centralized static transfer switch

Distributed static switches

System-level control, fault tolerant

Individual cores manage load transfers

Size of STS determines total capacity

Cannot parallel different sized UPS

White Paper: High-Availability Power Systems, Part II: Redundancy Options Š 2011 2010 Emerson Network Power


#3: Accidental EPO / Human error • 51% of outages caused by user error Pushing the EPO thinking it’s a light switch

How?

Improper equipment operation could drop the entire facility Careless installation of servers damages infrastructure

• Many people involved in data center operation – Too many cooks… – Alarms and control panels everywhere

• 100% preventable • Most cost-effective root cause to solve

© 2011 2010 Emerson Network Power


#3: Accidental EPO / Human error Best Practice: Documentation, Standard Procedures, Training and Remote Monitoring Shield EPO Documented Maintenance Procedures

Escort Visitors No Food or Drink

Infrastructure Monitoring

Keep it Clean Personnel Training Š 2011 2010 Emerson Network Power

Labeling One-Lines Follow Processes; No Short Cuts


#3: Accidental EPO / Human error • Best practices for EPO – – – – – – –

A / B EPO in A / B data centers Separate EPO from the fire alarm Remove local EPO from UPS and PDUs Provide physical protection Provide maintenance and test features Document and label Training

• 2011 code changes – NFPA 70 – 645-10, Disconnecting Means

© 2011 2010 Emerson Network Power


#4: UPS equipment failure • 49% of outages caused by UPS failure UPS has components with a finite life, some need replaced

How?

UPS repaired with non-OEM parts Blame the UPS when it’s really the batteries

• Reliability of a UPS only lasts as long as the shortest component life – Liebert design philosophy addresses this issue by reducing the number of parts, thus decreasing the chance of a failure

• UPS designed to prevent outages, not cause them

© 2011 2010 Emerson Network Power


#4: UPS equipment failure Best Practice: Preventive Maintenance by an experienced technician • At least two PM visits per year • OEM technician using OEM parts and calibration • MTBF for units that received two PM’s is 23 times higher than a machine with no PM service events per year

White Paper: The Effect of Regular, Skilled Preventive Maintenance on Critical Power System Reliability © 2011 2010 Emerson Network Power


#5: Heat- and water-related • 35% of outages caused by water incursion • 33% of outages are heat-related Cooling leaks and chilled water distributed in-row

How?

Repairs to in-row cooling causes chilled water leaks Server densities are rising, so is the heat

• As densities increase, cooling is brought closer to the IT load – For some in-row cooling products, water is on top of, next to and below critical electrical equipment – Solving the heat problem, but causing a water problem

© 2011 2010 Emerson Network Power


#5: Heat- and water-related Best Practice: Utilized refrigerants, easier maintenance and leak detection monitoring • R410A and Glycol for row-based units – Eliminate the need for water in the row

• Monitor for leaks under the floor

Refrigerant-based high density cooling

Point or zone detection

• Importance of easy maintenance for row CW units – Do you need to remove the in-row unit for repair? © 2011 2010 Emerson Network Power

Front and rear parts access


#5: Heat- and water-related Best Practice: Optimized airflow • Containment – Increases cooling capacity and energy efficiency

• Temperature sensors – Supply and return – Rack-level

• Utilize temperature data to control and optimize cooling output – Variable Speed Drives – Digital Scroll Compressors

White Paper: Combining Cold Aisle Containment with Intelligent Control to Optimize Data Center Cooling Efficiency © 2011 2010 Emerson Network Power


#5: Heat- and water-related • Optimized airflow not only prevents heat-related outages, it improves cooling efficiency Digital Compressor Variable Speed Fan

Requires less fan power per kW of cooling Leverages variable fan speed control Operates with digital scroll technology for variable capacity control Up to 33% efficiency gain Š 2011 2010 Emerson Network Power


What could be done to prevent unplanned outages in the future?

How to make the case for more resources and budget? What can be done short-term? Š 2011 2010 Emerson Network Power


Next steps 1. Educate your senior leaders on frequency and impact of downtime on your business – 56% of senior leaders think downtime doesn’t happen often

2. Utilize Cost of Downtime data to justify infrastructure improvements – Develop a business case or your own ABC model

3. Grab the “low-hanging fruit” – No cost to ensure IT staff doesn’t bring a Big Gulp onto the server floor

4. Conduct assessments and audits – Assess batteries, capacity, airflow– vendors can help

5. Talk to your infrastructure vendors – Service contracts, new technology, more best practices

© 2011 2010 Emerson Network Power


Q & A, further reading

Dr. Larry Ponemon, Founder and President, Ponemon Institute • National Survey on Data Center Outages • Coming Soon: Cost of Data Center Outages

Peter Panfil, Vice President and General Manager, Liebert North America AC Power, Emerson Network Power • Addressing the Leading Root Causes of Downtime

© 2011 2010 Emerson Network Power


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.