Operational Resilience - Operational Risk Sound Practice Guide by Institute of Risk Management

Operational Resilience Operational Risk Sound Practice Guidance

An IRM Group Company

Foreword The Institute of Operational Risk (IOR) was created in January 2004 and became part of the Institute of Risk Management in 2019. The IOR’s mission is to promote the development of operational risk as a profession and to develop and disseminate sound practice for the management of operational risk. The need for effective operational risk management is more acute than ever. Events such as the global financial crisis or the COVID-19 pandemic highlight the far-reaching impacts of operational risk and the consequences of management failure. In the light of these and numerous other events organisations have to ensure that their policies, procedures, and processes for the management of operational risk meet the needs of their stakeholders. This guidance is designed to complement existing standards and codes for risk management (e.g. ISO31000). The aim is to provide guidance that is both focused on the management of operational risk and practical in its application. In so doing, this is a guide for operational risk management professionals, to help them improve the practice of operational risk in organisations. Readers looking for a general understanding of the fundamentals of operational risk management should start with the IOR’s Certificate in Operational Risk. Not all the guidance in this document will be relevant for every organisation or sector. However, it has been written with the widest possible range of organisations and sectors in mind. Readers should decide for themselves what is relevant for their current situation. What matters is gradual, but continuous improvement.

The Institute of Operational Risk Sound Practice Guidance Although there is no one-size-fits-all approach to the management of operational risk, organisations must benchmark and improve their practice regularly. This is one of a series of papers, which provides practical guidance on a range of important topics that span the discipline of operational risk management. The objectives of these papers are to: • Explain how to design and implement a ‘sound’ (robust and effective) operational risk management framework • Demonstrate the value of operational risk management • Reflect the experiences of risk professionals, including the challenges involved in developing operational risk management frameworks

Contents 1. Introduction

2. Comparing the Benefits of Internal and External Event Data

2.1 Section Organisational vs Operational resilience

2.2 Important business services

2.3 Impact tolerance

2.4 Operational resilience ownership and governance

2.4.1 Ownership

2.4.2 Governance

3. The risk management lifecycle

4. Identifying important business services

4.1 Identifying all business services

4.2 Mapping processes

4.3 Measuring service impact

4.4 Defining important business services

5. Setting impact tolerance

6. Operational resilience scenarios

6.1 Defining scenarios

6.2 Scenario testing

6.3 Making effective use of the outputs

7. Operational resilience monitoring and control

7.1 Operating level monitoring

7.2 Board level monitoring

8 Implementation challenges

9 Conclusions

Section 1 - Introduction Operational resilience is defined as the ability of an organisation to deliver critical operations through disruption. It is an outcome rather than a risk and as such has historically been implicitly managed proactively through an organisation’s Operational Risk Management Framework (ORMF) and IT/Cyber monitoring and reactively through incident response, business continuity, crisis and recovery frameworks. The increasing globalisation of business, particularly organisations’ use of and reliance on (in house or outsourced) technology, has led to a greater focus on operational resilience as an outcome to be explicitly monitored and reported upon at Board level. The original organisational resilience principles in ISO 22316 (2017) have been supplemented by more detailed requirements for approach and quantification in financial services in the UK Prudential Regulatory Authority’s DP01/18, CP29/19 and most recently the Basel Committee on Banking Supervision’s (BCBS) principles for operational resilience (Nov 2020). Covid-19 has provided a live test case for the application of these principles and experiences continue to shape the debate on how operational resilience is managed. Whilst some organisations have set up operational resilience teams and functions and implemented new frameworks the Institute (and BCBS) contend that operational resilience is a component of operational risk and advocate its delivery by leveraging existing ORM and continuity frameworks. Operational resilience is an area that attracts diverse views among operational risk practitioners. Depending on the sector, scale and risk profile of an organisation, operational resilience approaches range in complexity and scope. For these reasons, the following paper does not recommend a one-size-fits-all solution. Rather, it outlines the key elements of operational resilience and a variety of good practices, using existing ORMF components, from which may be drawn a collection of appropriate, relevant, and proportionate ideas. In addition, we recommend reading the IRM’s paper Organisational resilience: A Risk Managers Guide for practical guidance and insight on organisational resilience in practice. Thank you to the author, Aidan Brock, and to the IOR’s Advisory Committee who reviewed this report.

Section 2 - Definitions Section 2.1 - Organisational vs Operational resilience The International Standards Organisation defines organisational resilience as “the ability of an organisation to absorb and adapt in a changing environment”[1]. Its key attributes can be broadly categorised as strategic/cultural or operational and this paper focusses on the latter. Strategic / cultural attributes Shared vision and clarity of purpose

Operational attributes Development and co-ordination of management disciplines Supporting continual improvement through monitoring and evaluation Ability to anticipate and manage change

Effective and empowered leadership A culture supportive of organisational resilience Shared information and knowledge Resource availability Understanding internal and external environment to make more effective strategic decisions about the priorities for resilience

Table 1: Organisational resilience attributes Operational resilience is defined as “the ability of an organisation to deliver critical operations through disruption”. Operational resilience has historically been managed proactively through the ORMF and reactively through continuity management and related frameworks, however, it has not necessarily been explicitly quantified or reported upon. Proactive resilience is characterised by operational risk management at the business unit and operations level, with areas of particular cross-organisation importance such as IT/ cyber and third party risk meriting their consideration and governance.

Figure 1: Operational resilience is an outcome, not a risk 5

Reactive resilience is most commonly managed through business continuity and incident response management and crisis / recovery planning, which are backward looking and focus on recovery from specific high impact individual risk events (eg power outage, cyber incident). Operational resilience is the outcome of the effectiveness of these risk management activities so its management requires co-ordination and understanding across them all. The Operational Risk Management Framework is key as it is used to understand, classify and map risks across the value chain of products to identify interdependencies and operational resilience gaps / areas for further investment. Section 2.2 - Important business services Operational resilience is an outcome, not a risk, so challenges organisations to look more holistically at how risks could manifest across the business, functional and operational areas. This is done by assessing risk through a service lense, as opposed to existing frameworks’ focus on managing risks by asset (e.g IT system) or function. The UK and European financial regulators have produced consultative documents on operational resilience and define a ‘business service’ as “a service that an organisation provides an external end user”[2]. The most important point here is that business services deliver a specific outcome or service to an identifiable user and are not business lines, which typically are a collection of services and activities. A business service is deemed important if its disruption would materially impact an organisation’s (financial or operational) viability, cause significant customer harm or impact its ability to deliver its Board approved strategy. Examples of important business services may include: • A bank’s payment services • A manufacturers production capability • An e-commerce providers fulfilment process Section 3 provides further guidance on identifying business services Section 2.3 - Impact tolerance Impact tolerance is a quantification of the level of disruption an important business service can absorb before it materially impacts an organisation’s (financial or operational) viability or causes significant customer harm. Financial service regulators require impact tolerance to be quantified in terms of (service outage) time, however, more than one measure may be used. Indeed looking at service delivery through a number of lenses may improve the organisation’s agility and responsiveness in reacting to outages. Section 4 provides further guidance on tolerance measures and setting impact tolerances. Section 2.4 - Operational resilience ownership and governance There is no ‘one size fits all solution to operational resilience ownership and governance, approaches will depend on the size and complexity of the organisation and how well developed and integrated its existing ORM and continuity frameworks are. Given that operational resilience is challenging organisations to look at risk through a service lense, co-ordination of activities across business and service lines is vital and irrespective of ownership the operational risk function has a key role to play in ensuring this happens. Governance should also utilise the existing framework wherever possible, an approach that minimises adding complexity will be easier for people to understand and owners to embed.

Section 2.4.1 - Ownership The Board The Board is the ultimate owner of operational resilience and is responsible for: • Taking an active role in establishing a broad understanding of the bank’s operational resilience approach • Reviewing and approving the bank’s operational resilience expectations, considered the bank’s risk appetite, risk capacity and risk profile • Approving important business services, impact tolerances and scenarios identified by the organisation • Satisfying itself that the organisation is meeting the requirements for having suitable strategies, processes and systems for identifying and managing operational resilience • Constructively challenging Senior Management on the firm’s operational resilience • Approving annual operational resilience self-assessments In many organisations current practice is for the Board to consider Important Business Services (IBS) and risk appetite statements drafted by Senior Management, however, this practice can result in anchoring and is open to challenge by supervisory authorities and during Board effectiveness reviews. The Institute feels that where possible Boards should be more involved in the process. This may not always be practicable in large and complex organisations, nevertheless, the overarching principle should be to involve operational risk practitioners to facilitate the process (whether at Board or Senior Management level) and that all decisions should be supported by operational data. This is explored further in Sections 4 and 5. Senior Management Senior Management are responsible for delivering the Board approved strategy and, by inference, ensuring IBS run within stated tolerances and that the organisation is prepared for taking remedial action should internal or unforeseen external events cause them not to. Overall responsibility for operational resilience will usually sit with the COO or CRO (or equivalent). In financial services, the SMF24 role holder (COO or equivalent) assumes responsibility for implementing the framework and reporting. In such instances, there may be a dedicated operational resilience resource or team within the operations function but oversight and governance would be expected to be executed through the risk management framework and associated Committees. Consequently, the CRO should be responsible for developing and maintaining the framework, training, the annual assessment and oversight and challenge. It is essential that responsibility for the resilience of IBSs is explicitly assigned to a senior owner. Which managerial level this is allocated at will depend on the size of the organisation, however, the overarching principle should be that the owner needs to be close enough to operations to be able to react quickly to restore an outage and be empowered to do so. Senior Management responsible for operational resilience must carry out annual selfassessments and present them for Board approval.

Business Management Managers across an organisation will be involved in the day-to-day management of a wide variety of operational risks, hence implicitly resilience. Some may be designated IBS owners to reflect their responsibilities for effective risk management. Business managers do not, normally, get involved with determining an organisation’s operational resilience appetite, given that this is part of a Board’s governance responsibilities. However, they may be involved in determining operational level tolerances and thresholds for key parts of a IBS that will, in the aggregate, inform impact tolerance. Where they are involved in setting risk tolerances these should not contradict overarching operational risk or resilience appetite. The operational risk management function or equivalent The operational risk function has a dual role: • Supporting the work of the Board (see above) • Overseeing the work of business managers and IBS owners in determining impact tolerance thresholds • Developing and maintaining the framework, training, the annual assessment and oversight and challenge In overseeing the work of business managers and IBS owners, the operational risk function should balance the activities and objectives of specific business units, departments or functions with the requirement to manage the operational resilience appetite set by the Board. The operational risk function has oversight of all business and functional areas through which IBSs are delivered, hence should have an understanding of end-to-end operational risks and be in a position to flag emerging risks as well as challenging tolerance limits where they are concerned about consistency. Where applicable the risk or operational risk committee can be used to support this oversight. Section 2.4.2 - Governance Operational resilience governance is the architecture (policies, structures, reporting arrangements, etc.) through which the management of operational resilience is monitored and controlled. The role of this architecture is to facilitate the oversight of operational resilience activities across an organisation, to ensure that operational resilience asset allocation decisions are made consistently and that these decisions support, rather than interfering with the achievement of an organisation’s objectives. The Institute contends that operational resilience is a component of the ORMF and should be managed within it. That said, some larger organisations have decided to set up operational resilience teams and/or functions independently from operational risk and other supporting frameworks (business continuity, incident management, IT & cyber, etc). When deciding which approach is best for your organisation consideration should be given to: • Size and complexity of the organisation • Breadth of services the organisation offers • Relative maturity of existing proactive (e.g. Operational Risk Management, IT & Cyber, Change management, third party & outsourcing ) and reactive risk (business continuity, incident response, disaster recovery) management frameworks

The key elements of operational resilience, which any governance structure must address, are listed below. Element

Description

Ownership Service identification

See 2.4.1 Identifying an organisation’s business services Assess importance of and identify risks in delivery of the service Agree important business services and set impact tolerances for them Define operational resilience scenarios, test them, learn from the results

Service mapping

Impact Tolerance

Scenarios

Monitoring and reporting

Training Review & oversight

Existing activities / resources that will inform the activity

Ongoing performance reporting through governance fora Internal training

Customer/product journey RCSA, process maps

Board level KRIs and KCIs, IT & cyber MI, Operational MI, Incident log, Risk Appetite Business continuity, IT & cyber, incident management and recovery plans and scenarios, operational risk scenarios Board risk appetite and KRIs, KCIs existing risk training programme

Requirements for ongoing review and self-assessment audit Table 2: Key elements of operational resilience

Section 3 - The risk management lifecycle Operational resilience requires organisations to look at risk through a service lense, not just by the system, business or operational area. Whilst this requires a shift in mindset operational resilience is an outcome, not a risk, so the key activities to manage the risks that in aggregate determine operational resilience should follow existing risk management process and use existing frameworks where practicable to do so.

Figure 2: The risk management lifecycle - as applied to business services The above approach means that activities can be integrated into an organisation’s existing operational risk management framework and processes (RCSA, operational risk dashboards, etc) which will have the following benefits: • • • •

Avoids additional complexity Quick to implement Less time consuming to execute assessments and keep updated Easy to understand and teach

It is important that activities and reporting associated with operational resilience are designed for simplicity and are adaptive as they need to support swift response to changes in the internal and external environment. Guidance on sound practice for each of the activities in figure 2 follows in sections 4 to 7.

Section 4 - Identifying important business services A business service is defined as “a service that an organisation provides to an external end user”. It is deemed ‘important’ if its disruption would materially impact an organisation’s (financial or operational) viability, cause significant customer harm or impact its ability to deliver its Board approved strategy. As previously stated the Board is ultimately responsible for identifying important business services, however in practice the COO and CRO (or equivalent) will be expected to have undertaken a joint assessment, backed up with data from the ORMF and operational / business areas, to confirm/ support or challenge the Board’s conclusions on an organisation’s IBSs. Proposals should be presented to and approved by an organisation’s Executive and Board Risk Committees (or equivalent). The initial process of determining IBSs sets the tone for how operational resilience will be managed so it is important that there is collaboration between risk, operational and business areas from the outset. This ‘blended’ approach retains clear responsibilities for implementation, design, and assurance, but encourages collaboration (a degree of overlap) between these roles. See the IOR’s Sound Practice Guidance on operational risk governance for further guidance. A business service is defined as “a service that an organisation provides an external end user”[3]. It is deemed important if its disruption would materially impact an organisation’s (financial or operational) viability, cause significant customer harm or impact its ability to deliver its Board approved strategy. As previously stated the Board is ultimately responsible for identifying important business services, however in practice the COO and CRO (or equivalent) will be expected to have undertaken a joint assessment, backed up with data from the ORMF and operational / business areas, to confirm/ support or challenge the Board’s conclusions on an organisation’s IBSs. Proposals should be presented to and approved by an organisation’s Executive and Board Risk Committees (or equivalent). The initial process of determining IBSs sets the tone for how operational resilience will be managed so it is important that there is a collaboration between risk, operational and business areas from the outset. This ‘blended’ approach retains clear responsibilities for implementation, design, and assurance, but encourages collaboration (a degree of overlap) between these roles. See the IOR’s Sound Practice Guidance on operational risk governance for further guidance.

Figure 3: Steps to identify important business services Section 4.1 - Identifying all business services In an ideal world, an organisation will have a list of the services they provide and accompanying process maps and documents. In absence of this, or where there is documentation at a business and operational level but nothing joined up by service, the COO and CRO should coordinate activities to complete this exercise. To help understand how collaboration can improve operational resilience governance it is important to remember that while the operational risk and internal audit functions may contain technical experts in operational risk, it is business management that will typically have a superior understanding of the day-to-day operations of the organisation. 11

By combining this risk expertise and operational experience, operational resilience activities can be made more effective and resource-efficient, adding value to the organisation by enhancing its ability to achieve its objectives and be agile in responding to present and future risks. A good starting point when considering what services an organisation provides is to look at the customer journey, or in manufacturing the production (and distribution) processes. Other sources of information to leverage when identifying services include but are not limited to: • Risk registers • Critical asset and third party supplier lists and risk assessments • Business continuity and recovery plans Once this activity is completed an initial expert judgment based assessment of which services are important will help to prioritise resource allocation for mapping services to inform the quantitative analysis in the ‘measure’ phase. Section 4.2 - Mapping processes Mapping enables organisations to identify, and manage, the key operational risks (people, processes and systems) associated with delivering a service. Understanding how a service is delivered and how it could be disrupted enables organisations to put proportionate measures in place to prevent service outage, and as a result, may create value through rationalisation of existing silo-based control activities. The mapping exercise should involve SMEs from the business and operational areas so it is recommended that each identified service is allocated a Senior Management level owner who is responsible for managing this process and for ensuring the ongoing resilience of that specific service. This process will be a significant undertaking for organisations, especially where services are delivered across jurisdictions and entities. There is currently no globally accepted approach for how this should be achieved, however, the UK Financial regulator (PRA and FCA) has indicated that it will require firms to “identify and document resources required to deliver each of their important business services and to identify the resources that are critical to delivering a service”[4]. Mapping should reference risks in the organisation’s Risk Register to identify (internal and external) interdependencies and interconnectedness. This is where the knowledge of the operational risk / operational resilience manager is key. They will have a more holistic view of risks than colleagues in purely operational or business roles, so should where possible facilitate workshops and ensure consistency of approach. Section 3.4 in the IOR’s Sound Practice Guidance paper on scenario analysis, stress and reverse stress testing provide guidance on conducting workshops. It is easy to get bogged down in detail when undertaking mapping and this is why the risk register is important and will enable you to focus on the largest risks. When doing this participants should assume that risks will happen so focus on impact, not likelihood. It can also be useful to refer back to operational risk first principles to get a helicopter view and check that all areas of operational risk have been appropriately considered:

Source People

Process

Comment Historically people risk has been overlooked, however Covid-19 and home working has demonstrated the importance of people (risk) in delivering services. Organisations should assess vulnerabilities and set resilience requirements or impact tolerances for people risk, not just focus on systems availability Consider the resource inputs into the process as well as the process itself. These could include: • Raw materials in a production process • Goods or services output from another internal service

Systems

External events

• Services provided by a third party This is not just IT and Communications systems, it could be production machinery or an ATM, for example. A key consideration here is interconnectedness and interdependencies, especially in relation to IT systems, outsourcing, use of Cloud technology and data security This is somewhat of a catch-all however these could include: • Regulation • Geopolitical • Climate • Markets (financial or other) Table 3: Useful risk ‘checklist’

4.3 Measuring service impact There is no prescriptive requirement to quantify impact using particular metrics, nevertheless, the measures should be: • Appropriate to the services that the organisation is delivering and regulatory requirement • Consistent with the organisation’s strategic objectives and risk appetite • Measurable at an operational level When deciding upon what metrics to use it is important to consider the end goal – resilience is dynamic and therefore measures should, where possible, be data-driven and able to be measured in as close to real-time as possible to support swift action should an operational event occur that impacts service delivery. Organisations existing KRIs should help inform what measures to use, however, bear in mind that this assessment will have to sit alongside and be relatable to the RCSA process and continuity activities so design for simplicity. Table 2 below gives some example measures. This, not an extensive list and the number of measures and assessment scale used will depend on the size and complexity of the organisation, nevertheless, the guiding principle should be to use fewer, better measures whenever possible.

Measure Client harm

Description Requires a subjective assessment, regulatory considerations may also have to be factored in

Comment One of the key measures stipulated by UK financial services regulation. In banks the provision of payment services would be deemed high impact (i.e. cause significant harm to customers if not available) but provision of ancillary account services may not

Volume

Could relate to customer numbers in a service industry, capacity in a manufacturing context The impact from a revenue perspective, not cost to remedy Assessment of the service’s importance from the point of view of regulatory censure (including fines) and reputational risk which could impact an organisations value

This measure gives an indication of the relative size of the service and can be sourced from / informed by existing operational KRIs

Financial impact Regulatory/ reputational impact

Where possible this should be linked to financial risk appetite KRIs Some services may not be important from a financial or a volume perspective but from a regulatory viewpoint be license to operate issues so have significant impact from a continuity perspective

Table 4: Example business service impact measures Section 4.4 - Defining important business services Once the initial assessment of services has been completed the services can be compared. This may be done through apportioning scores based on a number of individual measures (example below) or according to one particular measure (eg. client impact) the greatest significance. The results of this exercise should be reviewed and validated against expert-judgment based assessments of important business services from the Board or relevant governing body. Client harm

Capacity

Financial impact

Low (0)

No material impact on clients

<£1,000

Medium (1)

Some client inconvenience

High (3)

Causes significant harm to clients

Impacts <%5 of clients/ production Impacts 5-20% of clients/ production Impacts >20% of clients/ production

£1,001-£1m

>£1m Viability of the business at risk

Regulatory impact Little impact

Reportable, possible fine but no censure Significant fine and censure likely

Table 5: Example business service impact measure assessment

Section 5 - Setting impact tolerance As noted in section 2.4 the Board has ultimate responsibility for determining and setting impact tolerances for important business services. To set service tolerances the Board will need to identify the level of service disruption that could pose a risk to the organisation’s safety and soundness or financial stability. Impact tolerances are set assuming disruption occurs so unlike other appetite measures do not take account of likelihood, consequently impact tolerances should exceed risk appetite. As such impact tolerances provide useful input into Board level discussions on investment prioritisation and the change agenda. Tolerance will usually be expressed in terms of service outage time (this will be mandatory for UK Financial Services organisations) but can also be used in combination with other relevant metrics (for example number of clients impacted, production volume). Tolerances should be set at or before the point where disruption would cause an intolerable risk: • Of harm to consumers or market participants • Of financial harm to the organisation • Of harm to the organisation’s license to operate Techniques that may be used to set impact tolerance thresholds include: • Looking at historic trends in data series to understand normal versus exceptional, and intolerable, values • Benchmarking against industry standards, for example, the regulatory maximum allowable time for processing customer payments • Benchmarking between similar services within the organisation. This can be especially useful when they share common critical resources Where trends or benchmarking information are not available thresholds should be set using ‘expert judgement’, assumptions documented and signed off and the thresholds refined as additional information becomes available. This is not an exact science, however, the existing Board-approved risk appetite statement and accompanying KRIs can provide a useful point of reference from which to calibrate tolerances. Once the impact tolerance has been defined the operational level KRIs that will be used to manage the service’s critical resources (identified in the service mapping stage) can be derived. The operational risk management, continuity and IT/cyber frameworks will already include many of the required operational level KRIs as the risks identified during service mapping will already have been assessed and appetite set, albeit through a functional or systems lense rather than a service one.

Figure 4 below shows the sorts of information that can be used to inform impact tolerance setting.

Figure 4: Existing metrics that can inform impact tolerance The interaction between these information sources is best demonstrated using practical examples: Practical Examples Example 1: An organisation wishes to set impact tolerance for payments (identified as an important business service). The regulation stipulates that all payments must be processed within 2 working days of inception. The Board has no appetite for regulatory breaches. Experience of payment outages in the industry indicates that customers will close accounts and investors may stop funding when outages last over 3 days. The organisation decides that outages for more than 3 days could impact it’s safety and soundness and cause significant customer harm so tolerance should be set below this level. The organisation decides to set impact tolerance at 72 hours, above regulatory risk appetite and at the 3-day breaking point’. Example 2: An organisation wishes to set impact tolerance for the manufacturing process of product X. On average it holds 3,000 units of finished goods in stock to cover demand and forward orders and assesses that significant harm will be caused to the organisation if this buffer is removed. Historical performance data indicate that daily production averages 1,500 (accounting for maintenance periods) and daily sales average 1,250 units. Production must therefore run at 80% capacity to fulfil daily orders without impacting stock levels and the stock would be fully depleted if production stopped for longer than 2.5 days. The organisation decides to set impact tolerance based on a full production outage and production capacity being reduced to 50% (one of the two production lines going down). Impact tolerances are therefore set at 2.5 days at 0% capacity and 6 days at 50% capacity.

Section 6 - Operational resilience scenarios The (operational) risk function is responsible for the design and implementation of an organisation’s scenario analysis and stress/reverse stress test processes for operational risk events. The Institute recommends that this is also the case for operational resilience scenario design and implementation/testing, however where they exist this may also be undertaken by an operational resilience team. The responsible function should ensure that these processes are effective and periodically review their design and implementation. Scenarios are widely used in reactive resilience activities to test the effectiveness of business continuity, IT & cyber and disaster recovery frameworks and plans. The process of defining scenarios also provides an opportunity for process/service owners and those involved in scenario planning to: • Test the resilience of their organisation in relation to operational events and provide an opportunity to discuss, in advance, how to respond to them • Provide a forward-looking perspective, by focusing on managers’ attention on operational events that may differ from those in the past • Offer a break from day-to-day risk management activities, helping managers to think creatively about future operational events and to share their knowledge and expertise in a less timepressured environment • Improve the control environment, where potential gaps or weaknesses in existing controls are identified as part of the analysis Section 6.1 - Defining scenarios Traditional operational risk scenarios focus on risk prevention and use KRIs, KCIs and RCSAs to understand risk and control effectiveness. Impact tolerance assumes a service disruption has occurred and so operational resilience scenarios, which test an organisation’s ability to stay within tolerance, focus on response and recovery actions. The steps for defining operational resilience scenarios are the same as those for operational risk scenarios, but the output will be different. For more on the process please refer to section 3 of the IOR’s sound Practice Guidance on Scenario Analysis, Stress and Reverse Stress Testing (but ignore assessing probability). Scenarios for each IBS should be informed by the service mapping which identifies the critical resources on which the service relies, hence on which the scenario should impact. A significant amount of work will likely have already been undertaken to define and testing Business continuity, incident management and recovery plans so these should be reviewed and where appropriate used or enhanced, to avoid unnecessary duplication. The level to which these plans and resilience scenarios are integrated will depend on the size and complexity of the organisation and how operational resilience activities are governed (see section 2.4.2).

Variable Important Business Service Impact tolerance Scenario description

Impact Causes

Effects

Response

Recovery actions

Outcome

Explanation A description of the service against which the scenario is running The Board approved impact tolerance for the service A brief description of the narrative (storyline) of the scenario in question. What has happened and in what context (e.g. failure of a key supplier during a recession, business disruption during a pandemic, etc) Quantification of the service impact (should relate back to impact tolerance metrics) The events that lead up to the scenario, including people, process and systems failures or external events.Even though the assumption is that the event has happened it is important to spend some time looking at causes as knowing what went wrong will inform and shape the solution The effects of the scenario, notably how the outage impacts service, as well as potential impacts on people (e.g. health and safety or employee morale) and interconnected / interdependent assets and services A description of how the organisation is to respond to the incident. Communication (internally and to clients and, where appropriate, regulators) and timely action is key and key activities may already be outlined in existing incident management and recovery plans An assessment of what actions are need to return the service to acceptable operating levels. Where appropriate the response may cross refer to business continuity, IT & cyber and other incident management plans or recovery playbooks Quantification of the post-recovery service levels (should relate back to impact tolerance metrics)

Table 6: Example resilience scenario output variables Once the scenario exercise has been completed the outputs should be validated, wherever possible. This may be undertaken with reference to: • Internal experience – whilst the organisation may not have experienced the exact scenario they may have experience of certain facets of it. This is particularly true where the scenario includes IT and systems outages • External benchmarking against known events • Information sharing through practitioner fora The scenarios will also be subject to periodic review by the Board and internal audit function. Section 6.2 - Scenario testing The nature and frequency of an organisation’s testing should be proportionate to and commensurate with its size and complexity. Testing can be through paper-based assessments, simulations or live- systems testing. The scenario causes and effects will determine which type of testing is appropriate nevertheless all testing should include realistic assumptions and evolve as the firm learns from the previous testing. 18

Section 6.3 - Making effective use of the outputs Given the resources required it is important to make full use of the outputs from any scenario analysis process. This will include using these outputs for governance and compliance purposes and to support strategic and operational decision making. Boards should receive reports on completed operational resilience scenario analyses, especially where these relate to events and effects that could impact the strategy, business plan and financial viability of an organisation. Senior management and, where relevant, the risk committee should also receive reports on the output, including the actions being taken to mitigate the probability and impact of the operational risk events analysed as part of this process. Reports should not contain any unnecessary detail. Boards and senior management have limited time and must allocate this to a wide range of tasks. The focus of these reports should be on the potential impacts of events (financial or reputational) and the implications for the organisation’s financial position and business plan. Where appropriate information might also be provided on the actions taken to mitigate identified control weaknesses. This is especially relevant for senior management and the risk committee or equivalent. At an operational level, the results of operational resilience scenario analysis can be used to inform risk and control self-assessments. This is especially the case for assessments of inherent (gross) risk. This is because inherent risk assessments reflect a hypothetical level of exposure, assuming the absence/ineffectiveness of key controls and management can find it hard to determine reliable assessments of inherent risk given its hypothetical nature.

Section 7 - Operational resilience monitoring and control Operational resilience requires us to look at risk through a service lense. This is contrary to existing practice which primarily focusses on management by risk category.

Table 5: Example resilience scenario output variables Section 7.1 - Operating level monitoring Integrating operational resilience into existing risk fora and reporting may seem impossibly complicated, however, remember that the building blocks to support operational resilience (reporting) largely exist already. Resilience is already managed through the functional / asset lense so the operational level KRI and KCI already exist, they just need to be linked to business services, where appropriate. To do this refer back to service mappings, which (through reference to the risk register) identify the key risks that need to be monitored. For the above reasons, the Institute recommend that operational resilience is considered in existing operational risk management fora and reporting. There is no set way of doing this but the overarching rule should be to use existing information and wherever possible don’t overcomplicate. The objective is to identify emerging risks that could impact an organisations ability to deliver IBS (or stay within impact tolerance in the event of an outage) and, where appropriate, take action to mitigate them. Managing operational resilience is about anticipating change and reacting to external events, so monitoring should give due consideration to: • Interdependencies and interconnectedness between systems and services • Third-party dependencies and outsourcing – this is particularly important in the context of technology and cloud adoption [5] Change management and its impact on services • The external environment Section 7.2 - Board level monitoring The Board is responsible for defining important business services and setting impact tolerances. Boards should receive sufficient information for them to understand the organisation’s operational resiliency and that enables them to make data-driven decisions on investment where services are operating close to or outside impact tolerances. This should include: 20

• Information on operational events that have impacted operational resilience and steps taken to recover • Information on emerging risks and regulation that could impact future resilience and require consideration/investment decision reports • Or summary information on completed operational resilience scenario analyses, especially where these relate to events and effects that could impact on the strategy, business plan and financial viability of an organisation • Results of resilience audits This should be a distillation of information from and decisions made in Executive Risk Committee(s) and be included in the organisation’s Board Risk pack.

Section 8 - Implementation challenges Introducing the concept of operational resilience and implementing structures that support the management of it as an outcome is challenging. The temptation is to design a new framework and present it when completed however this will generally lead to resentment (yet another risk task to be completed on top of my day job) and make it difficult to embed. Early engagement with key stakeholders is important, as is communication - highlighting the benefits of adopting a service focussed approach to operational risk management - and training. Implementation and embedding will be made easier if the following are considered from the outset: • Design for simplicity – operational resilience requires the organisation to be agile and forwardlooking, able to react to changing environment and threats. An overly complicated and bureaucratic approach to managing for resilience will hamper this • Clear roles and responsibilities – agree on these at the outset. This should include agreeing on principles for ownership that will help engagement and embedding e.g. that the service owner needs to be close enough to operations to be able to react quickly to an outage or issue and is empowered to do so • Collaborative approach – operational resilience is an organisation-wide responsibility so governance arrangements should support collaborative working across risk, operational and functional areas. Section 3.5 of the IOR’s Sound Practice Guidance on Operational Risk Governance outlines such a ‘blended’ approach • Data-driven – the approach to managing operational resilience should wherever possible be data-driven. This will make it agile and easier to embed as the benefits to (customer) service of operational resilience activities can be quantified

Section 9 - Conclusions Operational resilience is an outcome, not a risk. Covid-19 and the speed of technological change across all industries have brought it, and the importance to organisations of managing resilience in a more agile and service focussed way than previously, into the spotlight. There no cross-industry consensus as to how operational resilience management should be implemented, however, the Institute recommend that operational resilience is part of the organisations’ Operational Risk Management Framework, not a separate risk management silo. Organisations should look to their existing risk management practices in the first instance as they already contain much of the information and proactive and reactive risk management practices required to manage operational resilience. This is an evolving discipline and operational resilience management frameworks and policies should be designed to support agile decision making and be adaptable to respond to emerging threats, issues and regulations. This paper should provide the operational risk practitioner with an understanding of the key foundational items and activities required to manage operational resilience and practical guidance on their content and implementation. Practitioners should also refer to the IRM’s paper Organisational resilience: a Risk Managers Guide for practical guidance and insight on organisational resilience in practice.

[1] ISO 2236 (2017): security and resilience – organizational resilience - principles and attributes [2] PRA - Operational Resilience Part 1 [3] PRA - Operational Resilience Part 1 [4] CP29/19 Operational resilience: Impact tolerances for important business services. [5] There is an increasing amount of regulation that deals with these topics and needs to be considered – for example, the PRA’s consultation paper on outsourcing and third-party risk management (CP30/19) and the European Commission’s draft regulation on digital operational resilience (DORA)

www.theirm.org

Developing risk professionals