Do you understand the impact of failure in your critical engineering infrastructure? A risk based approach.
List of Contents
Summary
3
1.1
Introduction
4
1.2
Risk Management
5
2
Critical Engineering Risk Studies
7
2.1
Compliance Risk Studies
8
3
Business Continuity
9
3.1
Real Time Risk Monitoring Tools
10
3.2
Cost benefits analysis
12
Conclusion
12
References
13
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 2 OF 13
Summary This paper addresses the failure of companies to fully understand and mitigate the risk from their critical engineering infrastructures, process and resource control supporting their businesses. Failure to fully understand the risks and failure to take the correct actions to remove or reduce the risk can result in a high cost to the business. Uptimeplus proposes a Risk Model that that combines traditional risk management techniques with real time risk status software management of the critical engineering infrastructure incorporating the three key elements namely People, Process and Critical Infrastructure as follows: Site specific critical Infrastructure visual risk dependency model which utilises live feeds and workflow streams to provide a real time status of both operational and capacity risks of the critical infrastructure whether it be electrical or mechanical is provided which can be accessed from any PC and a bespoke dashboard from any mobile device providing data centre managers with real time operational risk. Site specific compliance visual risk dependency model which utilises workflow streams with automated date monitoring and escalation processes which can be accessed from any PC and a bespoke dashboard from any mobile device. Site specific uptimeplus processes visual risk dependency model that tracks the implementation of uptimeplus CEM processes and provides automated date monitoring and escalation processes which can be accessed from any PC and a bespoke dashboard from any mobile device.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 3 OF 13
1. Introduction Failure to adequately identify and manage risks can result in devastating reputational and financial impact which has been aptly demonstrated within the finance industry within recent years. While risk management processes are widely available and utilised within business organisations of all sizes it is unusual for those risk management processes to be adequately documented and implemented for critical systems engineering where two main issues arise. Firstly it can be unclear to the managing teams what the impact of a failure of an asset or a process will have on its dependants and ultimately business operations. Secondly failures occur and they are either not reported or are reported without sufficient clarification of the risk to dependants and ultimately business operations. This situation arises due to the differing range of experts that are employed in the construction and management of a critical systems environment and the often incorrect assumption that all the risks have been mitigated during the design and construction phase. While this is not so prevalent in the datacentre environment it will be more common for smaller businesses running their own critical infrastructures. Often smaller critical environments are designed and constructed by suitably qualified teams and then handed over to a building/ facilities manager for day to day management. It is unusual to find a Building/Facility manager that has been trained in a variety of skill levels, i.e. Mechanical Electrical Engineering, IT engineering, Risk management, Facilities management. It is not the aim of this document to detract from the role of building/facilities manager but it is clear that when they are not technically trained they are then heavily reliant on the process documentation supplied during construction and from the incumbent maintenance suppliers to identify and mange risks. Both the physical engineering systems and the human systems supporting them must be evaluated to ensure the total system meets the business need with clear accountability and a full audit trail of issues raised and resolutions made The current financial crisis within the UK has resulted in businesses and organisations of all sizes looking to reduce operating costs and this will include maintenance and operation of the M&E assets. This has produced a very competitive M&E maintenance environment where maintenance companies look to reduce costs via multiskilling and reducing the number of time based maintenances that occur but often failing to implement a predictive maintenance scheme to detect potential asset failures. While for general building maintenance this is sufficient it is unlikely that a site engineer is going to fully understand when a failure of an asset has actually put the business and risk and may not even report the fact especially if a standby unit has started to keep systems operational. Taking all the above into consideration it is clear that any person responsible for a critical environment must have robust operational and reporting processes in place together with clear line of site of the impact of an asset’s failure on its dependencies.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 4 OF 13
1.2
Risk Management
The oxford dictionary defines risk management as “The forecasting and evaluation of financial risks together with the identification of procedures to avoid or minimize their impact�. [1] A robust risk management processes will identify the risk and evaluate the impact in conjunction with the probability on business operations and assets. It will also identify mitigating controls to reduce or remove the risk and provide some form of monitoring to ensure that the necessary actions and resolutions are implemented and recorded. Before identifying a risk it needs to be understood what the key drivers are as shown in Fig1.
Fig 1. Risk Drivers [2]
This paper will be concentrating on operational risks related to the M&E critical systems and the internal information systems that are required to identify assess and report any risk to business operations. The two recognised methods for identifying risks are the quantitative and qualitative approach. The qualitative risk assessment is generally considered to be a very straightforward process based on judgement requiring no specialist skills or complicated techniques.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 5 OF 13
Risk assessment of critical systems engineering will be quantitative where a numerical estimate is made of the probability that a defined harm will result from the occurrence of a particular event. Various methods are used to determine the numerical value including the following: Comparative Methods
Checklists
Audits
Fundamental Methods
Deviation Analysis
Hazard and Operability Studies
Energy Analysis
Failure Modes & Effects Analysis
Failure Logic
Fault Trees
Event Trees
Cause-Consequence diagrams
Once risks have been identified an evaluated and action plan should be created and reviewed before implementation, typically by asking:
Will the revised controls lead to tolerable risk levels?
Are new hazards created?
Has the most cost-effective solution been chosen?
What do people affected think about the need for, and practicality of, the revised preventive measures?
Will the revised controls be used in practice, and not ignored in the face of, for example, pressures to get the job done?
There is a variety of software packages on the market used for the qualitative and quantitative risk assessments and these provide away of quantifying and managing risk to ensure that any identified mitigation procedures and processes are implemented and that this implementation, or not, is recorded. However these systems can provide a building manager with a false sense of security especially where critical engineering systems are concerned. The typical modus operandi is that the risk assessment of the critical systems is made, processes and procedures implemented and then there will be a long time interval before the risks and procedures are reviewed if at all. Risk assessment should be seen as a continuing process. Thus, the adequacy of control measures should be subject to continual review and revised if necessary
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 6 OF 13
2
Critical Engineering Risk Studies
The first step to disaster tolerance is risk avoidance and the way to avoid risk in critical engineering is to identify and remove or mitigate single points of failure. In a critical engineering risk the three key elements are Technology, People and Process and a Single Point of Failure study should be carried out on all three elements. To ensure these provide an accurate assessment the following key points should be understood and clarified with the client before the survey starts. 1. 2. 3. 4. 5. 6. 7.
List of critical areas and supporting services needed to maintain business operations. Original design intent of the critical engineering system. Number of staff needed to maintain business operations External IT equipment and Links required to maintain operations Client’s business continuity plans and timescales before they are implemented. Cost of implementing Business continuity plan Value of loss caused by loss of business operations
Unless all the above are clearly understood there is a real danger that risks will be identified in the Single Point of Failure study with appropriate measures to mitigate the individual risks which are actually unjustified when compared to the value of the loss of business operation or implementing the business continuity plan. The single points of failure survey will review the following areas to identify the impact of failure on dependant assets or business operations. Internal
External
Standby Power systems
Supply Power
Power
Flood Risks
Cabling
Security
Cooling
Transport links
Segregation of Critical systems
Fire Suppression & Detection
Flood prevention
Training
Personnel
Emergency operating procedures
Carrying out single points of failure surveys is common practice within the industry and there is no argument that once completed it will provide a building manager an understanding of his risk and what is required to remove or mitigate that risk. What it does not provide is a real time view of the actual risk to his systems when an asset fails so he/she can correctly evaluate possible impact and decide what actions need to be carried out. Depending on the quality of processes and personnel the building manager is often left unaware that an asset has failed that could in time affect business operations.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 7 OF 13
2.1
Compliance Risk Studies
For businesses and organisations failure to comply with statutory regulations can cost both financial and reputational loss as well as the risk of prosecution. The UK is heavily regulated and ensuring compliance of the regulated tasks with planned maintenance, continual monitoring and completion of identified actions is a high burden on resources.
Periodic Electric Review
Boiler Certification (oil, gas & LPG)
PAT Testing
Landlords Gas Safety Certificate
Fire Certification
Energy Performance Certification
Fire Alarm Testing
Asbestos Surveys
Emergency Light Testing
Air Conditioning Servicing
Fire Extinguishers
Lighting Protection Equipment
Fire Risk Assessment
Health and Safety Laboratory, an agency of the Health and Safety Executive was tasked by HSE’s Legionella Committee in September 2011 to gather data on outbreaks of Legionnaires’ disease in Great Britain. This was completed for a 10 year period to August 2011, to identify the relationship with a range of factors.
Fig 2. Legionella Enforcements [3]
It can be seen from the above that 63% of the enforcements were due to legionella outbreaks on hot and cold water systems. Building Managers who fail to understand and control the risk of regulatory compliance tasks will find themselves not only part of the statistics but also, depending on the size of the impact, in the headlines! As previously identified there are number of software packages for managing risk and compliance however very few businesses and organisations invest the time and money to ensure they are operated effectively and time based audits will always find issues of either missed inspection dates or corrective actions not completed. It has become accepted that time based audits will always find issues and that it is a way of checking on incumbent maintenance providers and pushing them to get tasks completed.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 8 OF 13
3
Business Continuity
Business Continuity Management (BCM) is the process of planning to ensure that your business can return to "business as usual" as quickly and painlessly as possible in the event of a major disruption. “Around half of all businesses experiencing a disaster with no effective plans for recovery fail within the following 12 months” [4]. Businesses and organisations have a range of software packages and consulting companies to assist them with devising and implementing a business continuity plan but they all use the following basic planning and implementation steps for ensuring business continuity. Step 1: Analyse your business Step 3: Plan and prepare Step 2: Assess the risks Step 4: Communicate your plan Step 5: Test your plan
To ensure that a business continuity plan is effective it must have been tested and unless full testing is completed, documented and assessed a business will never fully understand if it’s contingency planning is sufficient to mitigate disaster. Having an effective business continuity management plan that has been tested will provide insight to what level of resilience is required with its critical engineering infrastructure and if we take the following extreme cases clearly Business 2 will require a far more resilient infrastructure Buisness 1 - Provides finance solutions to businesses on a software platform that has 5 mirrored servers in five countries and business operations will only be impacted if all 5 severs are down at the same time.
Business 2 - Provides finance solutions to businesses on a software platform that has 1 servers in 1 country.
However if Business 1 has decided, due to its operating model, that it does not need resilient infrastructures but has not fully tested that it can operate utilising only one server then clearly they are leaving themselves at risk. If a business relies on mirror sites as part of their contingency plans then they must ensure their testing is effective and complete and for each site would have to carry out the following:
Shut down IT servers
Remove all power to the property
Disconnect all data-links to the property
Very few firms can demonstrate that they have gone to these lengths to simulate an entire building loss often choosing software data transfer and testing as an alternative. Businesses must have a global view of their business and understand the risks across their entire portfolio and also have a means to identify the resilience impact on the businesses as failures occur.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 9 OF 13
3.1
Real Time Risk Monitoring Tools
Being able to identify key risk issues and illustrate these clearly and concisely to colleagues and business leaders, who are often non-technical, is a key requirement in the decision making and management process. [5]. Key factors to the success of critical engineering environments are:
Visibility Transparency Accountability
Auditability communicate quickly and accurately
It is uptimeplus proposal that for businesses to fully understand their risk across a range of systems real time monitoring and modelling systems are the way forward. Critical Systems Linking live status information from critical engineering systems to a visual risk dependency model will provide the building manager with accurate real time information regarding the operating status of his plant. In addition to this the visual risk dependency model would provide clear indication of the risk to the failed assets and ultimately the risk to his business operations. Having this information available to key staff will ensure that consensus is quickly obtained regarding the correct course of action, if any, to mitigate or remove the risk whether it be changes to the M&E systems themselves or moving critical workflows to other sites.
Compliance Providing businesses with real time visual data for regulatory compliance coupled with workflow systems that will automatically issue reminders of inspection and testing dates will reduce the need for frequent time based audits and so reduce resource.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 10 OF 13
Operating Procedures To prevent an incident escalating from a risk to a disaster requires standard and emergency operating procedures to be in place and utilised. Standard Operating Procedures are required to reduce the risk of an incident occurring by providing forward planning of staffing resource, staff Training and technical operation of the critical systems. Emergency Operation Procedures are required to ensure that when an incident does occur the correct action is taken by the onsite teams to prevent a disaster. Providing businesses with real time visual data regarding the status of all operating procedures will provide them assurance that the required procedures are in place and also give them access to those procedures so they can familiarise themselves emergency requirements.
Global View of Business Operations By providing a global view of the real-time risk levels to an entire business portfolio will ensure that appropriate decisions are made with respect to any implementation that may compound an identified risk. If a single critical site is at risk and this is immediately highlighted, having the ability to understand what has caused the risk will ensure that it is a) not repeated at other sites and b) gives you the opportunity to stop scheduled work that may impact your contingency.
Providing real time risk monitoring tools with clear visual indication of status will ensure businesses have confidence that there risk has been omitted or reduced to an acceptable level and also provide the visibility, transparency, accountability, auditability required for critical environments. As the information is globally available it will increase the ability to communicate quickly and accurately so the correct decisions can be made when an incident occurs. Ultimately this would reduce resource for both the Business and its support staff as the system would be self-policing. Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 11 OF 13
3.2
Risk Impact Cost analysis
The amount of money a business is going to invest in its critical engineering and business continuity plans will be representative of the losses that could be incurred in the event of a disaster. “In one case, the cost of a single interruption mounted to over €40 million. The total annual cost of the power interruptions in this company’s case was estimated to be in the region of €88 million”. Clearly this business had not carried out sufficient risk management of its critical engineering to protect against this loss, however, it may have been that the cost of implementing risk mitigation far out exceeded the cost of any losses. Before any risk mitigation is carried out whether it is for people, processes or technology risk impact cost analysis must be carried out.
The risk Impact cost benefit analysis will identify the cost to restore to restore services in a given time frame compared to the financial losses caused by downtime. This will provide you with details of the maximum cost benefit however other points need to be factored in such as the likely hood of repeat failures and reputational loss by even one failure. These extra factors may mean that a business will invest heavily in risk mitigation to ensure impact costs are minimal even though this is not the most cost beneficial approach. For a risk impact cost benefit to be useful the business must have a business continuity plan and completed Critical Engineering Risk Studies to identify the risks and real time risk information will enable businesses to produce effective models to ensure there money is spent wisely.
Conclusion The proposed model of real time risk modelling provides complete transparency of critical systems, compliance and operating procedures for both businesses and maintenance providers. This will provide the visibility, accountability, auditability required for critical environments. The visual risk dependency model will allow all businesses to understand the impact on operations if there is an asset failure or the increased risks that may or may not be prevalent during maintenance periods. In addition transparency of the systems is self-policing and will reduce the resource required for time based audits. Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 12 OF 13
About Authors: Andrew Dutton CEM Director uptimeplus 1290, Aztec West Almondsbury, Bristol BS32 4SG Email: Andrew.Duttton@integral.co.uk
Robert Clayton Senior Critical Infrastructure Manager uptimeplus 1290, Aztec West Almondsbury, Bristol BS32 4SG Email: robert.clayton@integral.co.uk
Web:
Web:
http://uptimeplus.co.uk
http://uptimeplus.co.uk
References [1] Oxford Dictionary. [2] Institute of Risk management - A Risk Management Standard [3] http://www.hse.gov.uk - hex1207.pdf - Legionella outbreaks and HSE investigations [4] BSCM Operational Risk – Critical Engineering [5] Greater London Authority.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 13 OF 13