Major IT Incident Management Process Version 3.3
July 2013
Service Process team
Major IT Incident Management Process Information Systems and Services
July 2013
What is a Major Incident? A major incident is an incident with a high impact, or potentially high impact, which requires a response that is above and beyond that given to normal incidents. Typically, these incidents require inter-team coordination, management escalation, quick mobilisation of resources and increased communications.
Purpose The purpose of the Major IT Incident Process is, in the event of a major incident, to:
establish clear ownership (accountability) and working group (responsibilities) for the incident establish a clear communications plan to operate for the duration of the incident minimise the impact to the University restore normal service as quickly as possible identify “lessons learned” identify areas for further investigation in order to prevent a recurrence
How do we identify a Major Incident? Note: Please also refer to the relevant service definition(s) in the Service Catalogue. If the answer to any of the following questions is “yes”, it’s a Major Incident:
Are more than the acceptable number of users (as stated in the service definition) affected? Is there a significant risk to the University’s finances, reputation, security or health and safety? Are we in a critical period for the affected service? Will a key service be unavailable for longer than the stated acceptable period (as stated in the service definition)? Has a member of the Directorate team or the Service Owner deemed this to be a Major Incident?
The Process Scope This process relates to all incidents which occur during staffed Service Desk hours: currently 08:0017:00, Monday to Friday, excluding University closure periods. When a Major Incident is identified 1. If the Major Incident is identified outside of the Service Desk then the Service Desk should be contacted on 8085. 2. The incident is logged by the Service Desk and declared as a major incident. 3. If the Service Owner or Service Lead are not the identifiers of the Major Incident then they should be notified by telephone. Where a duty phone is in use, this should be the first means of
Version 3.3 25/07/2013
Service Process team
Page 1
Major IT Incident Management Process communication. If staff cannot be contacted on their office number, mobile numbers will be tried. Failing that, someone should physically go to find them. Note: If the Service Owner or a Service Lead cannot be contacted for any reason, the issue will be escalated to the relevant Assistant Director who will then assume the role of Service Owner. 4. In all cases an email will be sent to the Service Owner advising them that the incident is in progress. Note: If the Service Management tool is unavailable, the process should still be followed on paper and a retrospective record must be created. Note: On some occasions, an incident may have been incorrectly identified as being a Major Incident; the Service Owner has the right to reverse any such judgement. Once the Major Incident has been logged 5. When a Major Incident is declared in NU Service an email will be sent automatically to all ISS staff stating the incident number and summarising the issue. 6. The Service Owner assumes leadership for the incident or assigns an Incident Lead. 7. The Incident Lead forms a Major Incident team comprising staff with appropriate skills, knowledge and capabilities from whichever ISS teams are necessary; this team must include: The Service Desk Team Leader or a member of the Service Desk team The Service Process Manager or a member of the Service Process team 8. A brief initial Major Incident team meeting is held to: Ensure that the Major Incident team is aware of the impact and priority of the incident. Establish a plan of action, ensuring that the main focus is on minimising disruption, restoring normal service as quickly as possible and providing useful and timely communications to customers. Agree a schedule for keeping the Service Desk updated during the Major Incident. Nominate a Communications Lead to be responsible for updating the incident record and communicating with the Service Desk. Note: At the Service Owner or Incident Lead’s discretion, staff can be working towards a fix whilst the meeting is in progress; it is not necessary to delay working on the incident until after a meeting has been held. 9. The Service Desk Team Leader is responsible for all information dissemination to customers (and other stakeholders) and for notifying senior management if appropriate. Note: Use of the University’s staff-announce mailing list, which is received by all members of staff, is restricted. It will only be used with valid reason and on approval of a member of the Directorate team. 10. If necessary, a follow-up email is sent by the Service Desk to iss-all@ncl.ac.uk with more details of the issue and the impact of the incident.
During the Major Incident 11. The Major Incident team works to resolve the incident. 12. The Communications Lead regularly updates the incident record in line with the timescales agreed in the initial meeting of the Major Incident team. If the Communications Lead physically cannot update the incident record (for example, if they are away from a computer dealing with Page 2
Major IT Incident Management Process the issue), they should provide updates to the Service Desk on 8085 and the Service Desk will annotate the incident record appropriately. Note: Even if there is no progress with the incident, the incident record must be updated at the agreed frequency. Note: The incident record is the definitive source of information for all ISS staff. 13. The Service Desk Team Leader remains responsible for on-going customer communications. Note: Where appropriate, members of the Directorate team, Business Relationship Managers or Service Owners should also be involved in customer communications. Immediately following resolution of the Major Incident 14. The Incident Lead ensures that all of the Major Incident team is aware that the Major Incident is resolved – the ‘stand-down’ notification. This must happen immediately or as soon as possible after the incident is considered to be resolved. 15. The Communications Lead must inform the Service Desk (via 8085) and the Service Process team that the incident is resolved. 16. Service Desk communicates the resolution as appropriate, including confirming the resolution of the incident with specific customers, if necessary. 17. The Service Desk resolves the incident record and sends an email to iss-all@ncl.ac.uk confirming resolution. 18. Any outstanding issues which are not urgent are logged as regular incidents. Major Incident Review 19. The Incident Lead and the Service Desk will write an overview of events, using the Major IT Incident Review template, for the Service Process team, within two working days. 20. The Service Process Manager will then coordinate a post-incident review meeting of all relevant staff (including a member of the Service Desk). Note: Under certain circumstances, a review meeting may be deemed unnecessary based on the following criteria: Small number of users actually affected Short duration of service outage at non-critical time No significant impact to the University’s finances, reputation, security or health and safety The cause is understood The Service Owner or a member of the Directorate team always has the right to request a review meeting. Note: The incident record and Service Owner’s overview will be used as the basis for discussion and review.
After the Review 21. A report, following the Major IT Incident Review template, will be circulated to the Directorate team and made available to ISS staff within six working days of the meeting. Prior to the report being published, those present at the review meeting will have had the chance to feedback on a draft report.
Page 3
Major IT Incident Management Process 22. The Service Process team will add the actions from the report to an actions log, which will be available to be viewed by all of ISS. Note: The person responsible for each action is responsible for keeping actions log updated with the status of their actions. 23. The Directorate team will regularly review the actions log and the recommendations from the reports. Decisions on recommendations should be fed back to the staff involved in the meeting that produced the recommendations. Major Incident Review timetable Details The Incident Lead and the Service Desk will write an overview of events, using the Major IT Incident Review template, for the Service Process team
Responsible Incident Lead
Timing Within 2 working days of the stand down of the Major Incident
Major Incident Review meeting
Service Process
A draft report will be produced following the Major IT Incident Review meeting and distributed to the relevant participants for feedback and comments
Service Process
Within 5 working days of the stand down of the Major Incident Within 2 working days of the Major Incident Review meeting
All feedback and comments on the draft report to be returned to the Service Process Team A final report, following the Major IT Incident Review template, will be circulated to the Directorate team and made available to all ISS staff
Incident Lead
Service Process
Within 2 working days of the distribution of the draft report Within 2 working days of the deadline for feedback and comments on the draft report
Out of Hours Major Incident Process 1. If a Major Incident is identified out of working hours, ISS staff will make reasonable endeavours to resolve the incident and communicate appropriately. The hotline number for informing and updating the NorMAN Out of Hours Helpline about major incidents is: 0191 243 7300. Note: This number is for the use ISS technical staff only in the case of a major incident; it should not be shared outside of ISS. For all other purposes the usual 0191 222 5999 number should be used (and shared with customers). 2. If the Major Incident is not resolved by the start of the next working day, the full Major IT Incident process is invoked. 3. If a Major Incident is fully resolved out of hours, the Service Desk must still be informed at the start of the next working day so that they can provide any necessary communications to customers. 4. A Major Incident Review will still be held unless the Service Process Manager deems it to be unnecessary.
Page 4