AL Refs: INC02, GOV35
Purpose
The purpose of effective incident management is to protect the interests of customers and users of your services, as well as your reputation and business continuity. This policy should be used in conjunction with the Incident Response Plan, Disaster Recovery Plan and Business Continuity Plan (as applicable). The Incident Management Policy covers the management all incidents (adverse, unplanned, material events) and types to set the overall management activities, approach to initial triage assessment, the key management responsibilities, and the linkages to the aforementioned plans that address major incidents and specific event types that surpass a general service desk or minor incident response approach.
Example Incident Management Policy
Responsibilities
Chief Operating Officer
Responsible for all aspects of the implementation and management of this Incident Management Policy, including its enforcement, revisions and communication of the policy.
AssuranceLab Management
Managers and supervisors are responsible for the implementation of these arrangements within the scope of their responsibilities and must ensure that all staff under their control understand and undertake their responsibilities accordingly.
All Employees
All employees and contractors are responsible for understanding their role in identifying, reporting, and assisting in the assessment and communication of incidents as it relates to their own role. All employees and contractors should be aware of what constitutes an incident and where these may have an adverse impact on AssuranceLab’s employees, users, third-party suppliers and/or own business operations and objectives.
Incident Identification
Incidents are identified through the following channels. These are in priority order of how incidents are best identified, with earlier identification and by AssuranceLab employees preferred to proactively avoid downstream impacts.
- System testing: Bugs and defects identified prior to deployment into production are not considered incidents. However, incidents may be identified during post-implementation review as part of the change and release management process.
- The system monitoring tools: DataDog, AWS GuardDuty and Pingdom identify and alert AssuranceLab personnel of system events and indicators that may require further investigation. These may avoid incidents or provide early identification.
- Internal user reporting: The development, client services, operations teams, or other users of the system identifying issues and raising through JIRA Service Desk.
- Customer user reporting: Reporting of bugs, issues, complaints, and other AssuranceLab failures received by customers and users of the system. These are raised through JIRA Service Desk.
All events identified through these channels that meet the definition of an incident should be logged into JIRA either automatically or manually.
Incident Classification
The incident classification is based on a Priority rating, that is the combination of considering the urgency and the impact of the incident. The purpose of classification is to guide AssuranceLab employees in the following steps for handling the incident. Judgement and common sense should be applied when the circumstances necessitate with retrospective reference to the policy to ensure all bases are covered.
Incident Handling
The incident handling has two key objectives;
- Mitigate the immediate impact to reduce the severity of the incident; and
- Resolve the underlying cause to remove the impact and prevent recurrence.
Target resolution time
Based on the Priority of the incident, the following response time service level agreements apply. These target timeframes should determine the steps, priority over other operational duties, and escalation to senior stakeholders, as applicable.
P1 – 4 hours
P2 – 24 hours
P3 – Next sprint
P4 – Add to product backlog for general priority assessment
Response
The response should be determined by the Priority rating and the Type of incident.
P1 – Enact the Disaster Recovery Plan where recovery of system functionality is required, and/or the Business Continuity Plan where operations are halted and AssuranceLab’s reputation is at risk.
P2/P3 – Follow the Incident Response Plan, with reference to the Disaster Recovery Plan if system or data restoration is required. Complete a post-incident review for lessons learned.
P4 – Perform a general service desk and customer support response and raise any system issues into the product backlog for resolution.
Post-Incident Review (PIR)
Complete an assessment of the incident during or after the event to identify the cause, resolution, and any lessons learned that can prevent future occurrence. Refer to the Post-Incident Review template.
Governance
To ensure effective management of incidents, the following governance practices should be performed.
- Weekly/monthly operations team meetings to review new and open incidents, those resolved since the past meeting and the progress of fixes for past incidents.
- Reporting to the Senior Leadership Team on major incidents, key metrics on incidence and resolution times, and a summary of any significant lessons learned.
- Annual review of the Incident Management Policy and related plans to ensure lessons learned and practice developments are accurately captured in the documentation and implemented in practice.