23 minute read

2 Incident management process

Next Article
1 Introduction

1 Introduction

2.1 Overview and process diagram

The process of incident management is shown in Figure 1.

Incident management is obviously one of the most visible processes to the user and as such represents an excellent opportunity to promote a positive view of the [Service Provider] amongst the user population.

The process of incident management is delivered via the Service Desk function.

Typically, incidents will be reported to the service desk by users via telephone. Once the identity of the user has been confirmed, the details of the incident will be recorded in the IT service management system by the Service Desk Analyst.

The incident will then be categorised and prioritised in conjunction with the user. An attempt will be made to resolve the incident at first point of contact, failing which it will be escalated to an appropriate support group until resolution is achieved.

The incident will be resolved by the support group that believes it has provided the fix and successful resolution will be confirmed by the service desk prior to closure.

Major incidents will be managed by a separate but related process which provides for more urgent allocation of resources and more formal communication procedures. See the document Major Incident Management Procedure.

Figure 1: Incident management process

Source: ITIL Service Operation Book 2011. Copyright © AXELOS Limited 2011. Reproduced under license from AXELOS

2.2 Process triggers

The incident management process is initiated as a result of one or more of the following triggers:

• An event occurs which is recognised by the event management process as an incident requiring investigation • An interaction from a user of a service within scope via one of the following methods: o Telephone o Email o Web portal o In person visit o Note: additional methods may be introduced as technology evolves • Notification from another part of the support organization including second- and third line and suppliers

2.3 Process inputs

The process of incident management requires a number of inputs in order to be able to function effectively. These may not always be available but will ideally be:

• Information about planned service outages • Details of the scope and timing of changes being implemented under change management (including those for which an outage is not expected) • Access to the problem management and known error databases, including workarounds • Access to the Configuration Management System (CMS) • Key business operation cycles e.g. peak times at which incidents may become more urgent • Organizational and contact information • Availability of key support resources • Service Level and Operational Level Agreements • Filtered event information from monitored Configuration Items (CIs) • User and customer communication about incident symptoms, priority and satisfaction levels

2.4 Process activities

The individual process activities at each step are detailed as follows.

2.4.1 Incident identification

There are two main methods by which incidents will be detected: via event monitoring and by the user. One of the goals of effective service management is to detect and fix incidents before they impact the user but in many cases this may not be possible.

The identification of incidents as part of event management is covered in the document Event Management Process. Not all events that are generated will result in an incident being logged and the criteria used to make this decision will be fine-tuned over time.

For incidents reported by users over the telephone there is an opportunity for the service desk to ensure that sufficient detail is obtained so that a good explanation of the incident can be recorded within the service desk system. If the incident is reported via self-service or by email however all the required detail may not be provided by the user. The service desk will then need to make contact and request the additional information.

2.4.2 Is this really an incident?

It is important that only items that meet the definition of an incident are logged and managed via the incident management process. Other types of communication from users such as service requests and change proposals will be subject to different SLA metrics and, if processed via the incident management system, will affect the validity of the reports produced. Service requests will be routed through the request fulfilment process and change proposals through the change management process.

If a service request has already been logged in the incident management system by a user (perhaps via self-service) then the incident record will be closed, and a service request logged instead. The user must be informed of this action. The closure category of such incidents will indicate that it was logged in error and these incidents will be omitted from SLA performance reports. They may however be reported on separately in order to gain a clear picture of how often this is happening and as input to service improvement activities.

In the event of a change proposal being logged in error by a user, the service desk will inform the user of this fact and request that the user log the change proposal via the correct channel. This will give the user the opportunity to provide the higher level of detail that will be required of a change proposal compared to an incident. The incorrectly logged incident will then be closed by the service desk using an appropriate closure category that puts it outside of SLA reporting. Again, such instances should still be reported on for service improvement purposes.

2.4.3 Incident logging

It is a fundamental principle that all incidents must be logged. This will allow accurate reporting of incident volumes and performance against the SLA and provides clearer

justification of the IT resources allocated to incident management. In the event that an IT analyst attends onsite to resolve one incident and is asked whilst there to resolve others, then all of the incidents reported (whether resolved at the time or not) must be individually logged.

An Incident will be recorded against the name of the user that raises it. For telephone and face-to-face interactions the contact details held within the service desk system should be confirmed with the user in order to capture changes of location, department etc. in a proactive way.

Whichever method is used to report the incident, the service desk will have available a set of scripts which set out the information required according to the subject of the incident. For example, for a situation where the user cannot print, the name of the printer will be required.

Incident models will be used where appropriate to automate the prompting of information required and the setting of incident category and priority. Where the user is on the telephone, these details should be checked (particularly priority) to ensure that the business need is being met.

2.4.4 Incident categorisation

Three levels of categorisation will be used for incidents. Where an incident model is used, these will be set automatically according to the specific model selected. Category hierarchies will be available within the service desk system and will be reviewed on a regular basis as part of process improvement activities. Changes to the categories will be managed carefully so that the implications to SLA reporting and problem management are understood and catered for accordingly.

The process manager will review on a regular basis the use of categories in logging incidents to ensure that they are used consistently by all parties, particularly by first line, but also by second and third-line teams.

The initial incident categorisation should be maintained throughout the lifecycle of the incident; in the event that it becomes clear that the incident should have been categorised differently then this will be reflected in the closure categorisation which should be a separate field on the service desk system.

2.4.5 Incident prioritisation

The priority of an incident will determine the order in which it is addressed by the service desk and subsequent teams involved in its resolution. This will be based on a combination of two factors:

• Impact: A measure of the effect of an incident on business processes • Urgency: A measure of how long it will be until an incident has a significant impact on the business

Both impact and urgency will be assessed on a scale of high, medium and low. The priority of an incident will then be calculated based on the rating of its urgency and impact as follows:

IMPACT/URGENCY HIGH

High

Medium

Low

Table 1: Determination of priority 1

2

3

MEDIUM

2

3

4

LOW

3

4

5

The priority of an incident will be calculated automatically by the service desk system based on the above rules.

The definitions of each priority level are as follows:

PRIORITY TITLE DESCRIPTION

1 Critical Significant disruption to the business Examples: All communications links to Site A are down Transaction-processing application is unavailable to all users E-Commerce website is unavailable to customers

2 High Significant disruption to parts of the business Examples: Warehouse running at reduced capacity One floor of office building without IT access Non real-time system unavailable

3 Medium Localised disruption affecting one or more users Examples: Single user unable to work System running slowly for a few users Intermittent network problems

4 Low Localised inconvenience affecting single user Examples: User unable to print to specific printer Power supply failed on PC Desktop software corruption issue

5 Planning Very minor inconvenience or non-urgent query Examples: Bug in infrequently used software PC runs slowly on occasion Personal printer has intermittent fault

Table 2: Priority definitions

The examples given in Table 2 are for guidance only; there may be circumstances where an incident affecting a single user has a significant business impact. The priority should therefore be set in consultation with the user.

2.4.6 Major incidents

In the event that an incident has sufficient impact and urgency to be classified as a priority 1, the Major Incident Management Procedure will be invoked. This procedure provides for the appointment of a major incident manager who will ensure that all necessary resources are allocated to the resolution of the incident and that appropriate communication channels are opened to keep the business and IT management informed.

In the event of doubt regarding whether an incident should be a priority 1 or whether the major incident procedure should be invoked, the Service Desk Supervisor should be consulted.

In the course of a major incident the activities set out in this incident management process will still largely be followed, including the regular updating of the incident record, but additional activities (such as the consideration of invoking the service continuity plan) will be carried out in parallel.

2.4.7 Initial diagnosis

If, after prioritisation, it is established that the incident is not major, then the service desk analyst will attempt to resolve the user’s issue immediately, if possible, using remote control tools.

The service desk knowledgebase may be referenced for articles that detail resolutions for incidents with similar symptoms. Known errors and workarounds that relate to problems that have been diagnosed but not yet fixed may also be found in the Known Error Database (KEDB).

Once an Incident has been logged, all activities performed with respect to that incident should be recorded as actions against the incident record e.g. adding notes, referring to supplier. Where appropriate, the option to send an update email to the user should be selected in order to keep the user informed of progress at all times.

If the incident can be resolved by the service desk analyst whilst the user is still in contact e.g. on the telephone, then successful resolution should be confirmed with the user and the incident resolved and then closed.

If the incident cannot be resolved while the user is on the phone, the service desk analyst will inform the user of the reference number of the incident and the target resolution time and end the call. It may be that First-Line is able to resolve the incident without escalation

to additional support teams in which case it will be processed, resolved and closed in the normal way.

2.4.8 Escalation

For those incidents which cannot be resolved by the service desk, possibly through lack of knowledge or access, they should be escalated to the most appropriate second-line team. This is functional escalation. The starting point for the decision regarding which team to escalate to will be the categorisation of the incident. A list of categories and the teams to escalate to will be held and maintained by the service desk with regular input and review from other teams.

Details of the methods and timescales for escalation will be documented in operational level agreements (OLAs) with each separate group. Where a second-line support group is external to the organization this information will be agreed in an under-pinning contract (UC). Copies of all relevant OLAs and UCs will be held within the service desk knowledgebase and where possible incorporated into the configuration of the service desk system.

In line with ITIL best practice, the ownership of an escalated incident will remain with the service desk as the primary interface with the user.

In addition to the hierarchic escalation involved in a major incident, there may be occasions when an incident of lower priority than 1 needs to be escalated to management. This may, for example, be when a priority 2 incident has exceeded its SLA timescale. Hierarchic escalation allows for communication to management of a potential issue and for the allocation of additional resources if they are felt to be warranted. Escalation via email will be achieved automatically via the service desk system.

2.4.9 Investigation and diagnosis

Beyond initial diagnosis, further investigation and diagnosis may be carried out either by the service desk or by a second- or third-line team. When working on an incident, analysts will temporarily assign it to themselves so that other analysts are aware that they are working on it and there is no duplication of effort.

If the current support team cannot resolve the incident the analyst may opt to escalate it further to an external supplier. If the external supplier has no access to the service desk system and no electronic interface is in place between [Service Provider] system and that of the supplier, the incident remains assigned to the internal team that escalated it. In this case it is the analyst’s responsibility to ensure that the incident is updated on a regular basis based on feedback from the external supplier.

During the diagnosis and resolution of an incident there are a number of actions that should always be carried out as a matter of course by each member of [Service Provider]. These are:

• Always update the incident record with actions carried out, even if it is only the fact that the user was telephoned without success. This shows activity on the incident which is useful to the service desk if the user calls for an update • Always set the incident to the resolved status as soon as this is thought to be the case so that a fair picture is obtained from later service level reporting • Be descriptive when entering the resolution text so that the incident record may be useful in future as part of the knowledge base. “Fixed” or “Resolved” is not acceptable • Make sure the incident is set to the correct status at all times, including when waiting for feedback from the user as the clock will be stopped at this point

Following these rules will mean that a clear, concise picture of the status of every incident is available at all times.

The length of time it takes to resolve an incident is of course a key metric when determining the level of service being delivered. There are times however when the duration of the incident is affected by circumstances outside the control of [Service Provider] and therefore the elapsed time does not represent a true picture of service provided. In this situation it is acceptable to stop the clock and restart it again when control is once again within [Service Provider].

These circumstances are:

• Waiting for further information from the user without which diagnosis and resolution of the incident cannot progress e.g. which printer is not working, what do they mean by “broken” • Waiting for the user to test whether the incident is resolved • Where the incident is waiting for information or assistance from a third-party supplier this will not stop the clock as it is a part of the service provided by [Service

Provider]. Contracts with suppliers will be aligned with the SLA so that they provide support within the SLA timeframe agreed with the users

2.4.10 Resolution identified?

Based on the investigation and diagnosis carried out, a potential resolution to the incident may be identified and this may need to be tested and confirmed in the next step of the process.

However, if no potential resolution has been found a further step of functional escalation may be required. This could be to the next level in the hierarchy of technical teams or could be to another team at the same level e.g. if the server team has ruled out a server issue and

passes the incident to the networking team so that possible network causes may be investigated.

In addition, hierarchical management escalation may be required if the incident is proving difficult to resolve and all existing avenues of support have been explored. It may be necessary to involve a specialist external organization which is not part of the normal escalation process.

2.4.11 Resolution and recovery

Once a potential resolution has been identified, this must be tested and implemented. If the required resolution action comes under the scope of the change management process, then a change request must be raised and assessed in the normal way before implementation. If the priority of the incident justifies it then an emergency change may be raised so that the corrective action can be applied as soon as possible.

Sufficient testing must be carried out post-resolution to confirm that the action has had the desired effect. Once satisfactorily resolved, the incident record must be updated with the details of the resolution (including the correct closure category) and passed back to the service desk for closure.

2.4.12 Incident closure

The user should be contacted by the service desk before setting the incident to the closed status to confirm that this is the user’s understanding also. If the user cannot be contacted after reasonable efforts (which should be recorded on the incident record), the incident may be set to closed without user confirmation.

The user will be automatically emailed to inform them of resolution and closure and a link to a satisfaction survey will be included in X% of cases.

If the incident was resolved without the root cause having been identified (e.g. the server was rebooted, and it seemed to work afterwards) then it may be appropriate for a problem record to be raised so that the underlying root cause can be investigated. In this case the new problem record should be linked to the relevant incident record(s) so that a history is available to problem management.

2.5 Process outputs

The outputs of the incident management process will be the following:

• Resolved incidents

• Complete and accurate incident records • Feedback from customers and users regarding levels of satisfaction • Communication and feedback to other service management processes such as business relationship management, service level management and problem management • Reports to management regarding incident volumes and process effectiveness • Problem records for incidents where no root cause was identified

2.6 Incident management tools

There are a number of key software tools that underpin an effective incident management process. These are subject to change as requirements and technology are updated and so specific systems are not described here. However the main types of tools that play a significant part in the process within [Organization Name] are as follows.

2.6.1 Service desk system

The service desk system provides the workflow engine and database to implement the core activities within incident management. These include:

• Incident logging • Routing and assignment of incidents to teams and individuals • Recording of actions against incidents • Updating of incident status from open through to closed • Definition and selection of incident models • Assessment of impact and urgency and auto-calculation of priority • Email communication with users from within incident records • Incident categorisation to multiple levels • Reporting • Definition of SLA targets for incidents • Automated incident escalation according to SLA • Provision of self-service interface for users to log incidents and view status of open incidents • Search capabilities for previously encountered incidents • Known Error Database (KEDB) accessible from within an incident • Facility to create incident records from an email mailbox

The service desk system is integrated with the systems that support various other processes, including problem, change and configuration management.

2.6.2 Telephony

As well as providing the basic functionality to make and receive telephone calls, the telephony system allows for the following additional features:

• Recording of front-end messages for users • Call queuing • Allocation of calls to service desk analysts according to predetermined rules • Ability to log in and out of the system to signify availability • Allowance for call wrap-up time • Real-time monitoring of call volumes and waiting times • Reporting on call volumes, length, sources and other parameters

The telephony system is also integrated with the service desk system to provide automation capabilities based on the calling number (CTI).

2.6.3 Remote control

The remote-control tool provides a method for the service desk analyst to take control of the keyboard and mouse of the user’s computer and see what the user sees. This is extremely useful in shortening the time taken to resolve incidents as it removes the need for the user to describe what she is seeing and gives the service desk analyst more certainty about what actions have been taken to resolve the incident.

2.6.4 Email

The email system is key to communication between the user and the service desk. In addition to allowing users to email in their incidents, it also provides an easily automated method of keeping users informed of the progress on the incidents they have logged.

2.6.5 Intranet

The intranet provides a window for the user into the IT organization in general and the service desk in particular. The self-service portal is accessible from the intranet and can be supported by helpful articles on how to resolve common incidents without contacting the service desk at all, thus speeding up resolution time for the user.

Reports on incident management performance against the SLA can also be made available via the intranet.

2.6.6 Configuration management system

The CMS provides real-time information about the hardware and software within the IT environment and allows the service desk and other teams to view any changes that have been implemented on key components that are under consideration with regard to an incident. It allows the installed software and its versions to be viewed without the need to access the user’s computer remotely as well as helping the service desk analyst understand the relationships between service components.

2.7 Communication and training

The incident management process is all about communication and there are various forms of this that must take place for the process to be effective.

These are described below.

2.7.1 Communication with users

In addition to the initial communication that will take place when an incident is reported by a user, it is vital that a two-way dialogue with the user is maintained regarding the progress of an incident, in particular:

• Whether the user needs to do anything to assist e.g. test a resolution or provide further information • Whether the target timescale for the resolution of the incident is likely to be met – if not, the user may need to make alternative arrangements and as much notice of this as possible is likely to be appreciated • If the user finds that the incident has resolved itself and no longer needs to be investigated (although a problem record may need to be raised) • If the user will be unavailable for a significant period of time perhaps due to holiday or assignment

Emails that are exchanged with the user should be incorporated into the incident record so that a full audit trail of all communication is kept and is available to whoever is working on the incident.

2.7.2 Communication between shifts

Although accurate and complete status information should be available as part of each incident record, it is also important that any key items of information should be passed on when service desk (and other support team) shifts change. This should include notification of ongoing major incidents, their status and next actions and a general summary of the

position of incident management at the end of the shift e.g. unusual call volumes, specific types of incident being logged more frequently than usual, or priority activities that were not able to be completed by the previous shift.

2.7.3 Process performance

It is important that the performance of the incident management process is monitored and reported upon on a regular basis in order to assess whether SLAs are being met and whether the process is operating as expected. The content of performance reports is set out in section 6 of this document, but it is vital that the reports are not only produced but are also communicated to the appropriate audience.

This will include the customers of the IT service with regard to SLAs and the management of IT concerning resource utilisation and allocation. Depending on the health of the process it may be appropriate to hold regular meetings with customers and IT management to discuss the performance and agree any actions to improve it.

2.7.4 Communication related to changes

Changes to the IT environment can have a significant impact on the delivery of an effective IT service and the service desk must be aware of any changes that are due to take place prior to them happening. This will allow the incident management process to diagnose related incidents more quickly and so potentially minimise disruption to users.

The incident management process manager must have visibility of the change management schedule as a minimum and ideally will be briefed on any changes with the potential to give rise to incidents. This may be a regular meeting or carried out on an ad-hoc basis according to the frequency of occurrence of such changes.

2.7.5 Major incident communication

The incident management process includes the opportunity to recognise that a major incident has occurred or is likely to occur and will communicate with the designated major incident manager from the initiation of the major incident management procedure through to its conclusion and beyond.

This communication will largely be under the control of the major incident manager, but updates should be given by the service desk to the individual in that role if things change e.g. if similar incidents begin to be reported from new areas of the business.

2.7.6 Training for incident management

In addition to a well-defined process and appropriate software tools it is essential that the people aspects of incident management are adequately addressed. The process requires that training be provided to all participants in order that it runs as smoothly as possible.

The main areas in which training will be required for service desk and other IT support staff are as follows.

• The incident management process itself, including the activities, roles and responsibilities involved • Incident management tools such as the service desk system, remote control and telephony • Soft skills such as customer service, dealing with difficult conversations and avoiding technical jargon • The basics of the technology and how it is implemented within [Organization Name] • The business, its structure, locations, priorities and people

In addition, training should be provided to the user population regarding how to access the IT support function, including:

• The difference between an incident, a service request and a change proposal and how they are handled • How to log an incident via the various means available • Use of the self-service portal, including logging, updating and tracking an incident • Use of the self-help service available via the intranet

This training may be provided via short workshops and supplemented by on demand resources such as videos and user guides.

This article is from: