INNOTRAIN IT
IT Service Management QUICK – SIMPLE - CLEAR Preview
Extract
Chapter 3.2
2011
IT Service Management
Authors
Dr. Mariusz Grabowski, Universität der Wirtschaft Krakau Dr. Claus Hoffmann, Beatrix Lang GmbH Philipp Küller, Hochschule Heilbronn Elena-Teodora Miron, Universität Wien Dr. Dariusz Put, Universität der Wirtschaft Krakau Dr. Piotr Soja, Universität der Wirtschaft Krakau Dr. Janusz Stal, Universität der Wirtschaft Krakau Marcus Vogt, Hochschule Heilbronn Dr. Eng. Tadeusz Wilusz, Universität der Wirtschaft Krakau Dr. Agnieszka Zając, Universität der Wirtschaft Krakau
I
3.2 Service operations
3.2.1
Service and infrastructure operations
"Help, my screen went black!" - We are all familiar with this call from users. Every day, we encounter a wide variety of these incidents in system operation, i.e. deviations from the plans. This chapter explains how to handle these incidents in a structured manner. But first, let's briefly explain the relevant terminology: The service desk is the central contact for all inquiries from users, according to the principle of "one face to the customer." This provides the customer with a single point of contact for all IT-related inquiries (e.g. hotline, ticket system). In a few cases, companies even expand their service desk to handle other inquiries (such as event management). The purpose of this single point of contact is to handle service requests, create an incident or to submit a request for change. The service desk can be understood as a kind of funnel that collects all messages and then steers them to the correct processes. Initially, enquiries of all types are treated as incidents.
Service desk The service desk is the central function of ITSM. It is the link between the IT service and business operations. This function is used to transact all enquiries from, and support provided to, employees.
The incident is an unplanned interruption (such as a workstation computer that won't work) or reduction of the quality of a service (like a slow Internet connection). Failures of elements of the configuration (Configuration Items) can be treated as an incident without a direct effect on the service. In this case, one example would be the failure of a mirrored hard drive, where the server is still up and running. In contrast, Service Requests are enquiries from users (with regard to information, consultation, standard changes, access) that do not have an effect on the service. They are one way to satisfy customers' needs. One example is a request for more printer toner when the printer indicates that it will run out soon.
Disruption or incident?
An incident is defined as an IT disruption or IT service enquiry. Examples of an incident could be: "My Excel is crashing" or "I need to create a PDF from Excel. How do I do that?". All incidents should be processed by the central service desk and the status updated to enable later evaluation.
The process for managing and processing incidents is usually known as incident management. It passes through various phases:
1. Entering and classifying the incident 2. Diagnosing the incident 3. Escalating the incident 4. Closing out the incident
The current Version 3 of the IT Infrastructure Library (ITIL) provides a sample process as a best practice. This can provide the basis for a company to develop its own processes, as the requirements differ enormously from company to company. For smaller companies, it is surely good enough to just pick up the phone without any overarching workflow, but in medium to large enterprises, this results in repeated interruptions of employees' core tasks. In this case, it is worthwhile to take the more formal route. In both cases, however, it makes sense to document the incidents and analyse them as part of an improvement process. Ideally, the incidents should be entered into a ticket system, but a simple list might be good enough at the beginning. ITIL recommends the following procedure:
Figure 9 - Incident process based on ITIL V3
Phase I - Identifying, entering and categorising the disruption The identification of incidents normally shared equally by the large number of users. Most deviations can be identified easily, and because of the high relevance for users, they gladly accept the effort required to report these incidents. In many cases, however, a proactive configuration is responsible for identifying an incident or imminent incident. For example, deviations can be identified on the hardware level (such as the failure of a hard drive in a RAID array) and reported. On a system level, the use of monitoring solutions has become widespread. This allows, for example, the function of an e-mail server to be checked by having the monitoring application send an e-mail and measuring how long it takes to receive a response. If a threshold value is exceeded during this process, this is designated as an event and an alarm is triggered. Depending on the systems used and the corporate culture in place, actually entering the disruption could be the responsibility of the user (in a ticket system, for example) or the service desk agent (for phone calls or e-mails). In large companies, correctly categorising the ticket ensures that it will be forwarded to exactly the right specialist. However, even in small companies, it provides the ability—beginning at a critical mass of disruptions—to identify weak points and areas for improvement. For example, if a particularly large number of problems with an office application are reported, it may be worthwhile to provide the users with better training or replace the application. Like the entry of the disruption, its categorisation can also be carried out by the user or an IT employee. Based on the collected data, a decision can be made as to whether the entry pertains to a disruption or service request, which triggers a separate process. Phase II - Prioritising The priority assigned to an incident specifies how the incident is handled by the employees and tools of the service desk. The prioritising process often holds a large potential for conflict between users and services providers, as by nature users will always assign top priority to their own incidents. Experience has shown that in many cases in which users determine the priority
themselves,
the
priority
differs
significantly from reality. Therefore, it is better Figure 10 – Example of prioritising incidents
to have the incident classified by the service desk employee, as only he or she has the
necessary overview of the current situation in the company. The definition of the priority is based on the effect on the supported business processes and the urgency until this effect takes hold.
Based on the priority assigned to the incident, response times (time until troubleshooting begins) and solution times (time until regular operation) can be defined. Phase 3 – Diagnosing and possible escalating The objective of the initial diagnostics is to gather all relevant facts (environment data, symptoms etc.). In many cases, this takes place in direct communication between the service desk employee and user. If the problem is simple or known, the employee will try to resolve it immediately. If this is not possible because there is not enough time or the necessary detailed technical knowledge is lacking, the incident has to be escalated for further handling. It can be distinguished between two types of escalation: Functional escalation is passing it on to another authority (person or team) with greater experience. The forwarding can be either internal (to in-house IT employees) or external (e.g. to a vendor's support staff). Nowadays, this is often referred to as “second-level support”. Hierarchical escalation refers to notifying and involving higher management levels to support the escalation. In this process, the higher-level manager is called upon to overcome organisation hurdles or mobilise additional resources to solve the problem in a timely manner. The appropriate specialists now have to create the diagnosis or escalate it further until the final diagnosis is reached. Regardless of the escalation level, the service desk is responsible for the incident, co-ordinates the activities and provides users with regular updates about the progress of their incidents. Phase 4 – Remedying the disruption Once a diagnosis for the disruption has been identified, it can be remedied and the normal state restored. The solutions should always undergo corresponding testing. For example, printing a test page after removing a paper jam can provide immediate information as to whether other problems exist. When applications are adapted in what are known as hot fixes or patches, the possible interactions should be examined before making them available on a large scale. After a successful resolution and restore, the incident can be closed out. In doing so, the service desk should ensure that the user is satisfied with the solution. In many cases, this is implemented by the system in that the service desk changes the status of the incident to resolved, but the user can close out the incident. In many cases, the next step is a brief survey with a few questions (3-5) to evaluate the quality of the service desk. Problem Management Incident resolved – is that all there is to it? Of course not. In many cases, though the incident is resolved quickly using the process we just described, but the cause is not eliminated and may
result in further problems. For example, if paper jams occur frequently in a certain type of printer, this type could have a manufacturing defect or be incompatible with the paper being used. Problem management is concerned with just these kinds of root causes.
Problem A problem exists when multiple incidents indicate a pattern. Central management of the incidents by the Help Desk allows recurring problems to be identified (e.g. Excel always crashes for user XY whenever he or she has Word open at the same time) and long-term solutions can be found.
A problem, i.e. a root cause of one or more incidents, is handled by the problem management process in multiple steps. Again ITIL provides an adequate reference process: 1. Identifying the problem; this is done by the employees of the service desk, technical support team or event management. 2. Entering the problem, providing links to the corresponding malfunctions, including a categorisation for later reporting and the prioritisation of the problem, in a way similar to incident management. 3. Diagnosing the problem with the objective of identifying the root cause. If the cause has been identified but no solution is yet available, a workaround (e.g. restart printer) has to be defined. This is entered as a known error and made available to the service desk so that it can remedy the corresponding disruption more quickly. 4. Finding a solution with the objective of implementing it as quickly as possible. However, if a change is necessary for final resolution, this should be done using the procedure defined in the change management system. This structured procedure reduces and acts as a check on the possible effects (for more information, refer to Chapter 5Fehler! Verweisquelle konnte nicht gefunden werden.).
Both incident management and problem management are based on identical concepts with regard to personnel and tools. In larger organisations, it is recommended to establish a separate team that runs the service desk function. These organisations can consider concepts such as the centralised or decentralised service desk, virtual service desk (e.g. in collaboration with a supplier) or even corresponding time zone concepts for international companies (follow-the-sun principle). In small IT organisations, the function can also be entrusted to an employee who is responsible for the service desk and is supported by his or her colleagues. Ideally, this should implement the concept of "one face to the customer" or, in other words, one contact person for the user in all matters. It makes it easier for users to communicate with IT, intercepts trivial enquiries directly and enables the remaining employees (e.g. developers or administrators) to concentrate on their core topics.
On the tool side, numerous commercial and open source solutions are available today. Ideally, the service desk should have the following applications available, which are integrated into one solution or linked to each other via logical interfaces:
1. Ticket system that manages and documents a disruption or problem over its entire life cycle. It should also enable communication with the user (e.g. via a Web interface or by email). 2. Database for collecting known errors and solutions (known error database, KEDB). This does not always need to be a lofty solution. For smaller organisations, a simple list is usually sufficient. 3. A configuration management database (CMDB) is a tool that supports many areas. The database supplies data and information about the entire IT landscape and thus helps to identify the context and identify problems more easily. For example, you can read which employee uses which type of printer at his or her workstation. For more information on this topic, refer to Chapter 4.
3.2.1
Systems & outsourced services
Hands on the keyboard: is the heart of your IT still beating? This chapter is all about the heart of information technology – the applications, systems, networks and hardware. However, a wide variety of activities are required in order to set up, maintain and operate this complex configuration. Have you already outsourced everything? Even if you have, this chapter provides valuable information. Before we really get down to business, let's stick with the subject of management for a bit. Many IT folks consider managing availability and capacity to be a strategic or tactical task. In smaller IT organisations, however, the usual scenario is that the specialist knows his or her systems in detail while also providing them with conceptual support; both topics are shifted to the operational level. Availability management is responsible for all aspects that pertain to the availability of a service. Generally speaking: when required by the customer, a service provides the needed and planned function as set forth in the SLA [Service Level Agreement]. Concretely put: when the user wants to retrieve his or her e-mail, the corresponding e-mail server has to be working. Therefore, availability management serves as a monitor to ensure adherence to the objectives defined in the SLA and provides the necessary and possible improvements in terms of availability. In doing so, availability management can make use of reactive and proactive means:
Re-active
Pro-active
!
Monitoring, measuring, analysing, reporting and verifying the availability
!
Examining the non-availability
!
Risk assessment and management
!
Implementing cost-appropriate countermeasures
!
Planning, designing and testing new or changed services
!
Testing the availability and failure mechanisms
Service providers often attempt to attract customers by promising 99% availability of the service. This availability in percent is calculated by dividing the actual availability of the service by the agreed service time:
!"#$%#&$%$'(! !"!! ! !
!!"#$$%!!"#$%&"!!"#$ ! !"#$!!"!!"#$%&"!!"#$#%&#'&%()! ! !"#$$%!!"#$%&"!!"#$
! At first glance, the value of 99 percent availability seems very high. However, let's convert this to minutes and days and see what the results are. Relative to one day, 99-percent availability means less than 15 minutes of downtime. Calculated over the entire year, these 15-minute periods add up to 3.5 days. On this basis, a decision can be made as to whether 99 percent is a realistic level or not. In retrospect, it is worthwhile to check that the promise has actually been kept. Another critical point for orientation is service availability (regardless of whether in-house or outsourced) from provision of the service to its consumption (end-to-end). For example, if we measure provision of a business application based on server uptime, other circumstances (e.g. failure of the network) between server and user can cause a downtime, which, however, is not taken into consideration. Accordingly, the measurement should be carried out as close to the receiver as possible in order to take all eventualities into account. If a failure occurs despite all preventive measures, availability management provides two additional metrics: !
Response time – Time between the report of a disruption and the beginning of troubleshooting.
!
Restore time – Time between the report of an incident and restoration of the service.
If service management is outsourced, the most important aspect to be considered when selecting the service provider is the restore time. Otherwise, the following case can occur: After a hardware defect, the provider already responds after a few minutes and initiates the order of the spare part.
However, if the spare part is not available and a week passes until delivery, the service cannot be offered again for a few days. Another management topic is managing the available capacity and the needed capacity in the future (capacity management). The Capacity Manager acts as the "fortune teller" of corporate IT. He or she does not look into a crystal ball, but instead analyses the current demand, monitors the company's development and, based on the corporate strategy, derives the future demand for services and the underlying infrastructure. He or she must ensure that the needed capacity is available in the planned quality at all times. Capacity management consists of three subareas: !
Business capacity management includes all activities intended to identify future business requirements and reflect them in the capacity plan.
!
Service Capacity Management refers to the activities that provide insight as to the capacities of the IT services required in the future.
!
Component Capacity Management includes all activities that monitor the capacity, performance and utilisation of the individual configuration elements (e.g. PC, printer, telephone, server).
We can put it most simply by saying that the future requirements of business for the services, and the demand of the services for the resources, have to be taken into account and reflected in the capacity planning. Based on this plan, actions are possible to ensure that the goals of the SLA are also met in the future. For example, the growth of the amount of disk space needed can be documented, a forecast derived from this and additional disk space purchased in a timely manner. This ensures that a cost-appropriate IT capacity can be maintained. Up to this point, we have only talked about management of IT and IT services. However, we must not forget the specialists who install and maintain the applications and systems. Depending on the size of the company, these technical operations are divided into various teams and responsibilities. The common differentiation is between responsibility for systems and applications. System support, the company's administrators are concerned with all hardware-related topics. In the ITSM environment, this task is often given the title IT operations management and includes management of the physical IT infrastructure (typically in data centres or computer rooms). The foremost goal is safeguarding and optimising the current, stable condition of the infrastructure. Examples of the tasks of IT operations management include: !
System administration and running operational activities and events
!
Console management and job scheduling of the servers
!
Backup and restore
!
Print management
!
Performance measurement and optimisation
!
Maintenance activities
!
IT facility management (climate control system, power supply etc.)
Application management, on the other hand, is responsible for designing, developing, testing and improving business applications. The areas of responsibility can vary greatly from company to company. If the software is developed in-house, the range of application management responsibilities widens. The other option is to outsource application development. Of course, there are many increments between these two solutions (e.g. standard software with in-house adaptations). The tasks of application management are defined as follows: !
Supporting the company's applications
!
In some cases, designing, developing, testing and improving applications
!
Supporting IT operations management
!
Training employees
3.2.2
IT procurement
The rapid development of information technology poses constant challenges to the IT departments of small and medium-sized enterprises: there are new kinds of technologies, changed services and innovative products. Do these have the potential to add value to the business or are they merely self-serving? Many calculation options are available for answering this question: !
Total cost of ownership (TCO)
!
Total benefits of ownership (TBO) / Total value of ownership (TVO)
!
Static or dynamic investment calculation
!
Return on investment (ROI)
It is, in fact, true that these options provide the company with correct results in subareas. Viewed separately, however, they do not provide valid results in the majority of cases. For example, there is no correct comparison of all costs and benefits, or only purely monetary variables are used. Ultimately, it is necessary to clarify whether the total benefit (TBO/TVO) to be expected justifies the total costs (TCO) to be expected over the service life or even creates a profit situation. In other
words: a return on investment consideration, which is not limited to the investment costs and the monetary benefits, but considers all costs and benefits. Once the investment decision has been made, "all" that is left is to purchase the new IT components. All too frequently, however, this plan proves to be extremely complex. Not without reason, as IT procurement processes affect multiple areas of an organization – including those outside IT – and include services of external providers, such as suppliers. Accordingly, close co-operation should be pursued and open communication maintained. Supplier management within the IT organization has the following objective: !
Regularly observing the procurement market and monitoring trends and innovations
!
Selecting suppliers, taking into account the strategic significance for the company's business processes
!
Negotiating contracts and agreeing on a fixed scope of services with the suppliers
!
Ensuring and continuously increasing the quality of the purchased service
!
Managing relationships with suppliers
!
Documenting all suppliers, contracts and relationships
In many cases, the tasks are also divided up between the purchasing department as such and the IT organization. In doing so, IT co-ordinates all
technical
aspects
in
the
cycle,
while
purchasing handles structuring the contracts and pricing. The greater a supplier's strategic significance for the company, the more long-term the business relationships should be. The significance can be defined based on two variables: Diagram 11 - Classifying suppliers
!
Value contribution and importance
!
Risk and influence
In most cases, a long-term, close-knit co-operation pays off. For example, blanket purchase agreements often allow more favourable terms and conditions when buying components (e.g. for the expected quantity of desktop computers in one year, while also allowing optimisations and relieving workload in the procurement process. Over the medium term, consistent standardisation can achieve additional effects of scale.
Is IT procurement not a relevant topic to companies that have outsourced all services? Even when full outsourcing is used, it is important for there to be a responsible contact person in the company; here, too, the customer–supplier relationship has to be maintained, quality monitored and the market observed regularly.
3.2.3
Security and environment
"Sony says sorry - the Playstation manufacturer has apologised for the massive data theft in its networks and promised free games as compensation and better security measures. (!)“. These or similar words were used by many daily newspapers to relate the story in spring 2011. Criminality in the IT environment is nothing new. As a small company, one could surely ask: who could profit from my data anyway? However, the topic of IT security is more varied than one might think, and certainly also relevant for smaller companies: !
First names, car marques, birthdays or the favourite football club—many people use easy-toremember terms to recall a password. Is a corresponding guideline in place in the company?
!
Is the company's administrator password securely stored with the administrator's supervisor in case the admin is absent?
!
Are virus scanners installed and are they updated on a regular basis?
!
Are hard drives securely deleted (wiped) before being disposed of?
!
Are important servers stored where they are safe from water or heat damage?
!
What happens to e-mails when an employee is on vacation?
!
Who is permitted to use his or her personal mobile phone in the company?
Numerous statistics prove that approximately half of security-related incidents are triggered not by external parties, but by the company's own employees. In almost all cases, this is accidental, usually out of ignorance, a lack of training or carelessness. Accordingly, SMEs should also analyse the possible hazards and take countermeasures. In doing so, all possible risks should be taken into account: !
Protecting the information from unauthorised access and malware (e.g. viruses, hacker attacks, espionage)
!
Provisioning the information to authorised persons (Access Management)
!
Securing the infrastructure against influences from the area surrounding the IT (e.g. overvoltage in the power supply network or power failure, flood, heat or even fire)
The measures taken (e.g. providing a firewall, using a climate control system) are to be considered preventive. The measures taken should be in proportion to the possible harm. Operating a server in the supply closet next to chemicals and moist rags is surely negligent. However, an autonomous, earthquake-proof data centre is surely also not the right choice for a small company. One hundred percent protection is possible in rare cases only or associated with high costs that are justified in only a few application areas. However, the possible risks should be specified accordingly and the measures planned in case the risks do occur. This is done in what is known as an IT recovery plan for various scenarios. The objective is to restore normal operation of the disrupted service(s) as quickly as possible. If, for example, the servers have to be shut down during a long-term power failure, the recovery plan should describe the systematic procedure at the start so that all dependencies between the systems are taken into account and no further delay or even damage occurs.