SmartCare Service Desk and Incident management
Standard Operating Procedure Version 1.9
9 July 2024
Overview
Zambia’s SmartCare (SC) Health Information System operates in over 1,500 health facilities, situated throughout the nation, representing nearly 100% HIV treatment current coverage of the electronic system. Three SC types are in operation including eLast, eFirst, SmartCare Plus (SC+) and SmartCare Pro. In its current state, over 1,300 SC facilities operate eLast, a retrospective system supported by data entry clerks. By 2018, a prospective, SC electronic medical record system (eFirst) was introduced at the health facility point of care with coverage in approximately 200 high volume sites. By 2020, the SC legacy system was re-engineered into SC+, an electronic medical record package, characterized by enhanced features for improved decision support at the clinical point of care; interoperability of information systems to improve health information exchange; and a connectivity and alternative power package to sustain reliable operations. SC+ replaced eFirst operations, and overtime, it will also replace a proportion of eLast sites. In 2022 a better version of SmartCare was inverted with a centralised Server and this was first deployed in 2023, this system is called SmartCare Pro. Each SC type is guided by specific standard operating procedures to assure optimal daily use for clinical management at the point of care and routine National decision-making.
To assure the SC system is performing as expected, the SmartCare Service Desk was introduced to monitor its operations in real-time. The SmartCare Service Desk is an incident management system used to detect and manage interruptions to operations, that may require a level of response, to assure a return to expected performance. The Service Desk is a single, and reliable, point of contact between the various users of the SC system and service providers. Its primary objective is to facilitate the resolution of end user/requestor incidents as quickly as possible. To assure active management, SC users can report and create incident tickets to notify on interruption to routine operations that may require further investigation and to assure timely return to normal operations.
The Service Desk’s effectiveness is in providing rapid responses to reported incidents by SC users. It is not designed to be an avenue for detailed SC training, whereas responses to such queries are redirected to the relevant Implementing Partner, Provincial Health Office, or IHM Capacity Building and Adoption team for follow-up. Reported incidents that cannot be immediately resolved by the Service Desk team, are escalated to personnel experienced in the specific applications. The Service Desk provides SC support in facilities and will assist in troubleshooting system and connectivity incidents.
Any downtime to the SC system, system technical difficulties, or needed enhancements that cannot be solved at the local level may represent a SC security risk, patient safety risk, or both. Incidents should be immediately reported to the Service Desk By the person who discovers the incident.
-
Objectives Of the Service Desk
The following are the objectives of the Service Desk system:
- Facilitate the efficient resolution of end user/requestor problems related to the SC system
- Maintain a historical log of IT support provided to end user/requestors of the system
- Act as a record of software and hardware incidents encountered by systems users
- Maintain a record of equipment needs
- Ensure incidents do not fall through the cracks as would be the case if reported in non-standard, ad hoc ways, or outside the guidance of this document
- Maintain a record of areas with training gaps to improve SC operations
- Generate statistics intended for active systems monitoring and reporting to stakeholders
- Hours Of Operation
The Service Desk hours of operation are 8:00 AM to 5:00 PM CAT, Monday to Friday, excluding all Zambia statutory holidays.
- Service Desk Service Operation
The Service Desk will endeavour to do the following to maintain a high level of quality customer service:
- Seek end user feedback and act on the results.
- Fulfill this service level agreement (SLA) with the MOH and with a minimum of 96% incident management success rate.
- Decrease customer downtime and incidents
- Apply current industry best practices
The Service Desk will achieve the above using the following management protocol:
- Service Desk staff receive communication from end users/requestors through calls, emails, and WhatsApp, and log these into the Service Desk system
- Trained Service Desk users, such as Implementing Partners or Ministry of Health staff, log into the system and submit issues
- Service Desk staff allocate priority status to all reported incidents and respond based on assigned priority status.
- An incident is assigned a priority code as described in Table 1, below
- Service Desk staff solve the problem or escalate the incident to the relevant IP, Ministry of Health Office, or IHM personnel
- Service Desk staff solve the problem, documents the solution, notify the end user/requestor, and support the end user/requestor to implement the solution for the facility
- The end user/requestor closes out the incident , to complete the incident management cycle
- Benefits Of A Service Desk
The following benefits are to be derived because of implementing the current Service Desk system:
- Efficient and tracked incident management
- Higher accountability from Implementing Partners, supporting staff, and users
- Aids the Service Desk unit, and supporting staff, to provide guidance to end users on possible solutions to ensure continued service provision
- Ensures that incidents are escalated appropriately to the next level of service provision if the incident cannot be resolved at a lower level or the service level agreed time lapses
- Follow up to ensure the incident or request has been resolved and the ticket is closed
- Enables Service Desk with a way to provide updates to users on status of logged incidents
- Assists in determining if reported SC incidents are systematic, or non-systematic, for the routine assessment of overall systems performance
- When To Contact The Service Desk Or Submit An Incident Ticket
As a rule, if the functionality of the SC system is not providing the results expected, then contact the Service Desk.
Contact the Service Desk if:
- SC is down and you cannot solve the problem easily
- Your facility champion cannot solve the SC problem, you have investigated as much as possible, but have not found the solution.
- You identified a SC problem, while you can work around it, it creates difficulties in using the system
- If you have any training requests, or have identified any training gaps, in using the SC system
- For any information regarding SC to improve local use
- There is a problem with using SC and you are not sure what it is
- You found a non-critical incident or problem, which does not affect system use, but should be fixed
- You have a suggestion for including new functionality in the system in line with National HIV care and reporting guidelines
- Your facility is participating in testing a new SC patch, upgrade, or feature and there are systems errors
- You have queries about the latest version of SC you should be operating, generation and submission of TDBs
- Roles And Responsibilities For Incident Management
Table 1 below articulates the different roles within incident management and the responsibilities associated with each role.
Table 1: Roles and Responsibilities
Role | Responsibilities | Who is accountable |
End User/ Requester | Contact the service desk to raise a new incident request.
Log in incidents through the Tuso Self-help portal Follow up on an existing request. Clearly communicate all the required information to technicians or Service Desk team. Acknowledge the restoration of service and completion of the ticket. Respond to follow-up surveys after ticket resolution completing the feedback loop. Close out the incident |
Healthcare provider any other system user |
Service Desk Officer | Log all incoming incident requests with appropriate parameters like category, urgency, and priority.
Assign tickets to technicians. Analyze and resolve an incident to restore service. Escalate unresolved incidents to the relevant technicians. Gather all required information from the requesters and send them regular updates on the status of their request. Act as a point of contact for requesters, and, if needed, coordinate between the Service Desk and Requesters. Verify the resolution with the end user and collect feedback. |
Service Desk Officer |
Service Desk Manager | Take accountability for the overall process of issue and incident management.
Define key performance indicators (KPIs) and align them with critical success factors (CSFs). Review KPIs and ensure that they meet business goals and CSFs. Design, document, review, and improve processes. Establish continuous service improvement (CSI) wherein the procedures, policies, roles, technology, and other aspects of the incident management process are reviewed and improved upon. Stay informed about industry best practices and incorporate them into the incident management process. |
IT Services Manager |
Training Coordinator | Responsible for training needs, knowledge gap and competency improvement among users for LSC, SC + and SC Pro
Train technical users, super users/champions, and user users/ frontline healthcare providers Perform training need assessments Conduct training mop ups Conduct provider shadowing and follow up competency assessments |
Capacity Building and Adoption Manager |
Hardware IT Technician | Provide technical support for computer hardware
Troubleshoot and resolve hardware incidences |
IT Officers |
Software Engineer | Design and develop software application.
Resolve software related incidences. Enhancing applications by identifying opportunities for improvement, making recommendations, and designing and implementing systems Maintaining and improving existing codebases and peer review code changes Liaising with colleagues to implement technical designs Investigating and using new technologies where relevant Providing written knowledge transfer material |
Software Engineer |
Senior Software Engineer | Lead and manage SC software development projects
Research and provide necessary technical guidance Provide LOE for software application development Responsible for accounting on software application incidences and resolution of the same |
Principal Software Engineer |
Database Administrator | Troubleshoot SC plus and LSC database related issues and incidents.
Ensure that the back end is operational and highly available, and troubleshoot any database problems, including breakage and corruption of base tables and records. Proactively monitor database systems to ensure minimum downtime, provide trend analysis and reporting, ensure database integrity, database backups procedures, create and maintain process and procedure documentations |
Database Administrator |
System Administrator | Troubleshoot and resolve issues and incidences related to system downtime. Actively resolve problems and issues with server systems to limit work disruptions at facility level
Responsible for the maintenance, configuration, and reliable operation of computer systems and servers for SC system. Participate in research and development to continuously improve and keep up with the IT business needs of system operation |
Infratel Sysadmin |
Network Administrator | Responsible for maintaining computer networks and solving any incident and problems that may occur with them.
Provide network administration and support Monitor computer networks and systems to identify how performance can be improved |
IT Officers |
Cyber Security | Responsible for in-building security during the development stages of software systems, networks, and data centers. Looking for vulnerabilities and risks in hardware and software and when vulnerabilities and breaches are found, closing them off. | Cybersecurity Specialist |
- Blanket Service Level Agreement
Based on the specific SLA with the MoH, Table 2 presents an incident priority matrix which defines priority of incidents based on their impact on SmartCare system and disruptions of healthcare services. Table 3 outlines the definition of levels of incidents, their prioritization, response times, and feedback protocols. The Service Desk will strive to consistently meet these timelines when logging all incidents and providing feedback.
- Incident Prioritization Matrix
Table 2 below illustrates the different definitions of prioritization of the incidents as they are received.
Table 2: Incident priority matrix
Impact | System wide and all live facilities | Selected health facilities affected | Single facility disrupted | Department within a facility | Individual user/HCP |
Clinical services disrupted |
|
|
|
|
|
Degraded services |
|
|
|
|
|
Clinical work not affected |
|
|
|
|
|
- Severity /Priority Level
Table 3: Service Desk Priority Status and Feedback Protocol
Priority level | Priority | Definition | Response Times | Resolution
Time |
Feedback Protocol |
P1** | Critical |
|
15 minutes | 4 hours | First update to end user/requestor, within 1 hours, whenever possible
Further updates at 1-hour intervals Escalate to IT Officer and Training Officer at end of the day, if the problem is not resolved |
P2* | High |
|
Within 4 hours | 8 hours | First update to end user/requestor within 2 hours
Further updates at 2-hour intervals until incident or issue has been resolved. |
P 3 | Moderate |
|
Within 8 hours | 7 days |
|
P 4 | Normal |
|
The request is forwarded to the relevant IP/Provincial Office/IHM within 1 business day | 14 days |
|
- System Availability Calculation Matrix
Table 4 below depicts how the metrics and standards for system availability are calculated.
Table 4: System Availability Calculation Matrix
Service Level Target | ||||||||||
Downtime per Week in minutes | ||||||||||
Hours per Day | Days per Week | 95% | 98% | 99% | 99.99% | 99.999% | ||||
8 | 5 | 120 | 48 | 24 | 14 | 1.44 | ||||
12 | 5 | 180 | 72 | 36 | 22 | 2.16 | ||||
18 | 5 | 270 | 108 | 54 | 32 | 2.24 | ||||
24 | 5 | 360 | 144 | 72 | 43 | 4.32 | ||||
24 | 6 | 432 | 173 | 86 | 52 | 5.184 | ||||
24 | 7 | 504 | 202 | 101 | 60 | 6.048 |
Source: ITIL Release Management: A hands-on Guide, 2010 1
- Reporting Incidents Through The Service Desk
There are several ways to report incidents through the Service Desk. These have been put in place to make the process as simple, and as seamless, as possible.
- Tuso Self Service Portal
The Tuso Self Service Portal lets users submit incidents, request services, view announcements, chat with support staff, consult its Knowledge Base for self-help, reset domain passwords or unlock accounts, and more. The Portal provides end users with 24/7 self-service and self-help capabilities accessible from computers and mobile devices.
The Service Desk portal can be accessed via the web: https://tuso.ihmafrica.com
To access the portal credentials will have to be created for the end user/requestor by the Service Desk unit. Orientation will be provided by IHM to IPs on use of the portal. Additional requests for orientation can be made via the Service Desk team
- Phone Call
Service Desk incidents can be logged in between 8:00 AM and 5:00 PM using the following numbers:
Toll Free
- 8080
Chargeable numbers
- +260979655211
- +260762436771
To log in an incident via WhatsApp use the following numbers:
- +260979655211
- +260762436771
All WhatsApp messages must include name of end user/requestor, email address of end user/requestor, province, district, and name of facility.
Service Desk incidents can be logged in by emailing: support@ihmafrica.org
Incidents will automatically be logged, and the end user/requestor will receive an automated response with an assigned ticket number.
- Information To Have Ready When Contacting The Service Desk
For a productive and hassle-free experience with the Service Desk, end users/requestor must have the following information at hand when reporting an incident. This information helps to ensure the incident is logged properly allowing for efficient resolution and feedback.
- Name of person logging in incident/request.
- Province, District, facility name, facility HMIS code that incident/request is being reported for
- Specify if facility is a parent site or hub site.
- Version of SC being used and if reporting for eFirst, eLast, SC+ or SmartCare Pro operations.
- Date of submission of last TDB for eFirst and eLast sites
- Detailed description of the incident/request (Example: “I am unable to run MER report” or “A end user/requestor record corrupted when I was trying to save a clinical interaction”)
- If the incident causes any error messages to appear, end user/requestor must provide the exact text that is displayed and the module or service that was in user when the error occurred.
- How often the problem occurs any pattern has been noted that leads up to the problem occurring.
- Include any supporting information such as screen shots, reports, NUPIN of affected end user/requestor /s.
- Provide organization’s name and your contact information – phone and email – to enable the service desk team to communicate/provide feeback .
- Feedback And Follow-Up In Service Desk
Aside from the regular feedback protocol, the Service Desk team will endeavor to provide stakeholders with weekly pending status updates on reported incidents using the following channels:
- Phone
- The SC Self-Service portal
The Service Desk will also provide weekly emails to partners as routine status update and for outstanding tickets pending Implementing Partner intervention. All reported incidents will be assigned a ticket number. In order to make a follow-up, users must reference the ticket number allocated at the time the incident was reported upon initial intake to the Service Desk system.
- Procedural Steps For Incident Management
- Step 01: Incident Logging
Objective: This procedure describes the set of operations required to log an incident
Responsible: Service Desk staff and SC Pro system end-user or requestor
- Log incident or request through Tuso self-service portal immediately using https://tuso.ihmafrica.com
- Where self-service portal is not available, log incident or request through phone call by calling the Help-Desk toll free number 8080:
- If the toll-free number is not going through, log incident or request through SMS or via WhatsApp live chat.
- Step 02: Ticket Creation
Objective: This procedure describes the set of operations required to create a ticket in Tuso Self-Service Portal
Responsible: SC Pro system end-user or requestor
- Log incident or request through Tuso self-service portal immediately
- Where self-service portal is not available, the SmartCare end user or requester should send an email to support@ihmafrica.org (All emails must include name of end user/requestor, province, district, and name of facility)
- If there are any limitations preventing the SmartCare end user or requester from sending an email, they can/should log incident or request through phone call by calling the help-desk toll free number:
- If the toll-free number is not going through, log incident or request through SMS or via WhatsApp live chat.
- Step 03: Incident Categorization
Responsible: SmartCare Pro end-user or requestor
- Tickets are categorized using three levels: 1st Level category, 2nd level category, and 3rd level category.
- During ticket creation, a user will select the appropriate category based on the type of incident, such as Software, Hardware, Network, Supplies, and Human Interference
- Aside from Tier categorization, during ticket creation a user will also select category based on type of incident, i.e., Software, Hardware, Network, Supplies and Human Interference.
- Step 04: Incident Prioritization
Responsible: SmartCare Pro end-user or requestor
- During Ticket Creation, the end user/requestor picks the Priority level according to the impact of incident at hand.
- The list of priorities in Tuso are.
- High Priority
- Normal Priority,
- Medium Priority,
- Low Priority
- Not Prioritised.
- Refer to Table 2 for guidance and information on how each priority is managed.
- Step 05: Incident Assignment
Responsible: The Service Desk team
- Incidents are assigned to:
- IT Officers- Issues that may need the Hub input to resolve such as database unavailable are assigned to them.
- Software Developers – All software incidents that require development relating to SmartCare and feature requests.
- Capacity Building and Adoption – Incidents that require SmartCare understanding through training are assigned to training for action.
- Implementing Partners – These are primarily issues that require equipment replacements.
- Incidents are assigned after the Service Desk team has analysed the nature of incident.
- Step 06: Task Creation And Management
Responsible: Support Provider assigned an incident
- The provider will manage tasks and provide feedback within the System.
- Refer to Table 2 for further guidance and information.
- Step 07: SLA Management And Escalation
Responsible: The Service Desk team
- Monitoring: Regular tracking and monitoring of SLAs to ensure compliance.
- Manual Escalation: Support staff can manually escalate tickets if they require urgent attention of higher-level intervention.
- Communication: Clear communication channels and protocols to ensure all parties involved are aware of the escalation and their roles in resolving the issue.
- Resolution Tracking: Continuous follow-up on escalated tickets to ensure they are resolved within the stipulated timeframes. Step 08: Incident Resolution
Step 08: Incident Resolution
Responsible: Support Provider assigned an incident
- The Service Desk team makes follow up with end user/requestor to get full details of incidents.
- The team then communicates with the Support providers for solutions and feedback.
- The Solution is then shared with the respective end user /requestor (s).
- An incident will be resolved by support providers as per the escalation procedure.
- Refer to Table 1 for further guidance and information.
- Step 09: Incident Review And Corrective Measure
Responsible: Service Desk Team and Support Provider assigned an incident
- The Service Desk team reviews the incidents and makes follow up with the support providers.
- Weekly meetings are held with the developers Share the provided resolutions with the affected facilities for action on the incidents.
- Resolutions shared by the Service Desk team and feedback is received from facilities indicating the output of the resolution.
- Step 10: Incident Closure
Responsible: SmartCare Pro end-user or requestor
- Once a resolution is provided, the user should close the incident.
- The Service Desk team should get in touch with the end user/requestor to confirm that the incident is no longer occurring and that the fix provided is working
- The user should change the ticket status from old status (i.e., open or resolved and open) to close.
- The user clicks Save to update the incident.
- Incident Management Metrics And KPIs
Table 5 below shows the different metrics used in ensuring an efficient and effective incident management system
Table 5: Metrics and KPIs for Incident Management
No. | KPI | Formula | Target |
1 | SC Pro system availability | SC Pro system availability reporting will include both scheduled and unscheduled outages or downtimes incidents [Availability of system in percentage for the period of interest (per week)]. SC application defined as a critical application. | ≥ 99.99% |
2 | Incident resolution rate (%) | Proportion of incidents that have been resolved among all incidents logged in the system
IRR=[resolved incidents/Total logged incidents] |
≥95% |
3 | Average resolution time or Average turn-around time (TAT) | The average time taken to resolve an incident
[Sum of the duration of all incidents resolved during the month] /[Total number of incidents resolved during the same month] |
C ≤ 4 hrs.
H≤ 8 hrs. M≤ 7 days L ≤ 14 days |
4 | SLA compliance rate | The percentage of incidents resolved within an SLA. [Number of incidents with met SLAs resolved during the month] /[Total number of incidents resolved during the same month] | C ≥ 99.99%
H≥ 95% M≥ 90% L ≥ 80% |
5 | First call resolution rate (%) | Percentage of incidents resolved in the first call | ≥95% |
6 | First Level resolution Rate (%) | [Number of incidents resolved at first level in the week] /[Total number of incidents resolved during the same week] | ≥95% |
7 | Reopen rates | [Number of incidents resolved during the month that were reopened] /[Total number of incidents resolved during the same month]
Reopened is when the Reopen Count field is greater than 0 for the incident. |
C ≤ 1%
H≤ 5% M≤ 10% L ≤ 20% |
8 | Incident backlog | The number of incidents that are pending in the queue without a resolution. | There are always less than <maximum> unresolved problems. |
9 | Percentage of major incidents | The number of major incidents compared to the total number of incidents. | ≤ 10% |
10 | Mean time to acknowledge (MTTA) | The average amount of time between a system alert and a team member acknowledging the issue.
For example: If you had 10 incidents and there was a total of 40 minutes of time between alert and acknowledgement for all 10, you divide 40 by 10 and come up with an average of four minutes. MTTA=40 minutes/10 incidents=4 minutes MTTA is useful in tracking responsiveness. Is your team suffering from alert fatigue and taking too long to respond? This metric will help you flag the issue. |
C ≤ 15 minutes
H≤ 30 minutes M≤ 1 hr. L ≤ 1 hr. |
11 | Mean time to resolution (MTTR) | What it means: The average amount of time it takes to respond to or resolve an incident.
What it can show: MTTR can show how quickly your team is able to respond to or resolve issues as they arise. Calculation of MTTR: Mean time to recovery is calculated by adding up all the downtime in a specific period and dividing it by the number of incidents. So, let’s say our systems were down for 30 minutes in two separate incidents in a 14-hour period. 30 divided by two is 15, so our MTTR is 15 minutes. MTTR= [Total unplanned repair or resolution time/Total number of incidents] MTTR=30minutes/ 2 incidents=15 minutes |
C ≤ 4 hrs.
H≤ 8 hrs. M≤ 7 days L ≤ 14 days |
12 | Mean time between failure (MTBF) | The average time between failures. It is calculated by dividing the total uptime by the total number of failures.
MTBF=Total uptime/ # of incidents Where: Total uptime=(24 hrs- 2 hrs. of incident/downtime)=22hrs. # of incidents that happened within 24 hrs say are 2 Therefore MTBF= ((24-2)/2)=11
|
Reduce the time between failures |
13 | Mean time to detect (MTTD) incidents | The average time taken to detect incidents or anomalies. | C ≤ 15 minutes
H≤ 30 minutes M≤ 1 hr. L ≤ 1 hr. |
14 | End user satisfaction on incident resolution
|
% of completed scores on problem/incident resolution satisfaction survey have a rating of satisfied or very satisfied. | ≥95% |
- After Action Review/Post-mortem/Post-Incident Reviews
After Action Review (AAR) or Post-Incident Reviews is a critical incidents review process that will be conducted to pin-point weaknesses in the people responsible for developing designing, developing, deploying, implementing, and maintaining SC Pro system with other interfaces. These platforms will also facilitate identification of weaknesses in processes and tooling applied for operations of the system with an aim of continuous improvement of the IMS and optimizing the performance of SC Pro system and all other interoperable sub-systems.
“Don’t let a good crisis go to waste! Learn from it to be better next time. It’s all about getting better -not finding blame. Establishing a positive, blameless culture of post-incident evaluation is based on an honest and in-depth evaluation of the incident response. It signals to the organization that technology failure is the perfect opportunity to learn about your operating environment and make improvements to minimize future IT downtime”. 2
Any organisation that values IT should aim to continuously reduce Mean Time to Resolution/Repair (MTTR). 3 As SCHISS project we propose an After-Action Review (AAR) that should be sustainable in the normative programmatic environment for MoH.
- Justification For AAR
The AAR platform provides an opportunity for understanding on what went wrong and provide an opportunity for lessons learnt. Lessons learned from the incidents may vary as the incidents themselves and may include among other things.
- Certain incidents may help the project team and MoH see a blind spot in the system architecture and or/ service delivery.
- Perhaps some mistakes were made in detecting the problem and thus somebody learned a cool new trick with the monitoring or alerting tools.
- Perhaps a junior member of the Incident Response Team (IRT) was covering the shift for a senior.
person, handled a tough event, and gained valuable experience and confidence as an Incident Manager.
- Perhaps a person or team chronically misses established incident response.
service level agreements (SLAs), thereby slowing the MTTA.
- AAR Process
The following are processes that should be followed in conducting an AAR or post-mortem.
- Gather Relevant Information
- Gathering of all incidents’ records for incidents under review in the period of interest (weekly review). The required data will include the following.
- Incidents logs
- Communication recordings
- Scribe information from the incidents
- Incidents’ timelines in hours, minutes, and seconds (all events by timestamps)
- Roster of accountable incident responders
- Discussion of possible incidents resolutions
- Develop An Incident Timeline
Capturing of timelines for all events that took place during incident response is also critical. Capturing timestamps, a summary of key events, and the discussions that take place to support the decisions made during an incident provides valuable insight in to how the people responded to the incident. This process also provides critical data to serves as needed input for improve how the incident responders perform during an incident.
- Ensure Relevant Participation
Ensure relevant participation in the AAR and encourage active participation and discussion among the participants.
- AAR/RCA Generic 5 Questions
The AAR will be combined with RCA. To start these processes the AAR team will ask the following 5 questions.
Table 6: Questions for AAR
No. | Main questions | Probe questions |
|
A description of the problem (symptoms) | What happened? |
|
A brief description of the cause of the incident | What caused/contributed to the problem?
This is somewhat subjective and may be quite complex based on your environment. The intent is to capture what caused a change from uptime to downtime. |
|
Who responded to solve it, and what are the time stamps for their dispatch and arrival on the incident? Were the initial responders the right ones for the incident or was it necessary to escalate to more or different SMEs? | Were the right people assembled in the right spot to make the right decisions at the right time? |
|
What solution was implemented? | Did the incident responders choose the right solution? |
|
What was the MTTA for the right team of responders and what was the MTTR? | How long did it take to assemble and solve the problem? |
- Structured AAR Sheet
In addition to the aforementioned questions, the project team may use the following standardize template for carrying incident AAR or post-mortem.
IMS AAR Sheet | ||||||||
Goal: To improve incident response, determine what broke or went wrong and how people responded to the thing that broke, and determine what steps need to be taken to prevent a similar situation from happening again. | ||||||||
List IRT | Speciality of IRT | |||||||
Incident commander (IC) identified and announced: | ||||||||
IC transferred/changed and list the reason for transfer or change: | ||||||||
Coded responses: N=Not completed, Y=completed, P=partially or completed later | ||||||||
# | Review questions | Weight | Rating | |||||
N | Y | P | ||||||
|
Was sizing up of incident/problem completed, accurate and well-articulated? | 0 | 10 | 5 | ||||
|
Were appropriate SME’s requested? Is SME response time acceptable? If not, please list reasons why? | 0 | 10 | 5 | ||||
|
Was the SEV level for incident identified and announced? | 0 | 5 | 1.5 | ||||
|
Does the IC (Business Analyst) control the flow of the discussion and drive the incident towards resolution in an effective and timely manner? (Validate this against targeted resolution time as per the SLA) | 0 | 35 | 17.5 | ||||
|
Did the IC adhere to acceptable span of control numbers? (If exceeded, was this acceptable?) Did the IC control the extra numbers effectively? | 0 | 5 | 1.5 | ||||
|
Did the IC establish effective communications? | 0 | 10 | 5 | ||||
|
Is there an incident timeline, and estimated time to resolution? | 0 | 5 | 1.5 | ||||
|
Were briefings, notifications and postings made at the appropriate time (i.e., every 1 hours )? | 0 | 10 | 5 | ||||
|
Did the IC develop a backup plan and/or consider second/ third tier alternatives? | 0 | 5 | 1.5 | ||||
|
Did the IC discuss notifying DR at 30 minutes? Did they make appropriate notifications at the 1-hour mark for potential DR? | 0 | 5 | 1.5 | ||||
|
Total | 100% |
- Deduce Relevant Recommendations For Improvement
The main purpose for the AAR is to pin-point weakness, learn from the process and identify relevant solutions or continuous improvements to the people part of the response as well as the technology problems encountered. Schnepp, Vidal & Hawley (1017) suggest that when it comes to recommendation and improvement of nontechnical of the IMS and the system being serviced, the acronym TALENT (Training, Accountability, Leadership, Empowerment, and notification) should be applied as a framework. Based on the relevant recommendations, data informed actions or activities should be operationalized into work schedules for the responsible individuals.
Table 7: AAR Recommendation and accountability matrix
Action | Timeline | Responsible | Assigned Depart- | Notes |
- Definition Of Terms/Glossary For Incident Management
Table 8 below is a glossary of key terms contained in this document
Table 8: Glossary of Key Terms
No. | Term | Definition |
|
Incident | An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item, even if it has not yet affected a service, is also an incident (e.g., failure of one disk from a mirror set). |
|
Incident identification | The process of discovering an incident |
|
Incident logging | Creating and maintaining a record of an incident in the form of a ticket |
|
Incident categorization | Recording an incident with due diligence so that it’s placed under the appropriate category. |
|
Incident closure | Closing an open incident ticket once the incident has been resolved |
|
Incident escalation rules | A set of rules defining the hierarchy for escalating incidents, including triggers that lead to escalations. Triggers are usually based on incident severity and resolution time |
|
Incident management | Managing the life cycle of all incidents to restore normal service operation as quickly as possible and minimize business impact |
|
Incident management report | A series of reports produced by the incident manager for various target groups (e.g., teams responsible for IT management, service level management, other service management processes, or incident management itself). |
|
Incident manager | The person responsible for the effective implementation of the incident management process and carrying out reporting. Also represents the first stage of escalation if an incident is not able to be resolved within the agreed service level. |
|
Incident model | Contains the predefined steps that should be taken to deal with a particular type of incident. |
|
Incident monitoring | Tracking the processing status of outstanding incidents so that counter measures may be introduced as soon as possible if service levels are likely to be breached. |
|
Incident prioritization | Assigning priorities to incidents and defining what constitutes a major Incident |
|
Incident record | A collection of data with all details of an incident, documenting the history of the incident from registration to closure. |
|
Incident report | A report that includes information about incidents, how they were handled, and other data that can help measure the performance of the incident management process. |
|
Incident resolution | The workaround or correction that fixes the incident and restores service to its best quality. |
|
Incident status | How far along an incident is in the incident management process? Common statuses include:
|