Incident Management Policy

Purpose

The purpose of effective incident management is to protect the interests of customers and users of your services, as well as your reputation and business continuity. This policy should be used in conjunction with the Incident Response Plan, Disaster Recovery Plan and Business Continuity Plan (as applicable). The Incident Management Policy covers the management all incidents (adverse, unplanned, material events) and types to set the overall management activities, approach to initial triage assessment, the key management responsibilities, and the linkages to the aforementioned plans that address major incidents and specific event types that surpass a general service desk or minor incident response approach.

Example Incident Management Policy

Responsibilities

Chief Operating Officer

Responsible for all aspects of the implementation and management of this Incident Management Policy, including its enforcement, revisions and communication of the policy.

AssuranceLab Management

Managers and supervisors are responsible for the implementation of these arrangements within the scope of their responsibilities and must ensure that all staff under their control understand and undertake their responsibilities accordingly.

All Employees

All employees and contractors are responsible for understanding their role in identifying, reporting, and assisting in the assessment and communication of incidents as it relates to their own role. All employees and contractors should be aware of what constitutes an incident and where these may have an adverse impact on AssuranceLab’s employees, users, third-party suppliers and/or own business operations and objectives.

Incident Identification

Incidents are identified through the following channels. These are in priority order of how incidents are best identified, with earlier identification and by AssuranceLab employees preferred to proactively avoid downstream impacts.

System testing: Bugs and defects identified prior to deployment into production are not considered incidents. However, incidents may be identified during post-implementation review as part of the change and release management process.
The system monitoring tools: DataDog, AWS GuardDuty and Pingdom identify and alert AssuranceLab personnel of system events and indicators that may require further investigation. These may avoid incidents or provide early identification.
Internal user reporting: The development, client services, operations teams, or other users of the system identifying issues and raising through JIRA Service Desk.
Customer user reporting: Reporting of bugs, issues, complaints, and other AssuranceLab failures received by customers and users of the system. These are raised through JIRA Service Desk.

All events identified through these channels that meet the definition of an incident should be logged into JIRA either automatically or manually.

Incident Classification

The incident classification is based on a Priority rating, that is the combination of considering the urgency and the impact of the incident. The purpose of classification is to guide AssuranceLab employees in the following steps for handling the incident. Judgement and common sense should be applied when the circumstances necessitate with retrospective reference to the policy to ensure all bases are covered.

Incident Handling

The incident handling has two key objectives;

Mitigate the immediate impact to reduce the severity of the incident; and
Resolve the underlying cause to remove the impact and prevent recurrence.

Target resolution time

Based on the Priority of the incident, the following response time service level agreements apply. These target timeframes should determine the steps, priority over other operational duties, and escalation to senior stakeholders, as applicable.

P1 – 4 hours

P2 – 24 hours

P3 – Next sprint

P4 – Add to product backlog for general priority assessment

Response

The response should be determined by the Priority rating and the Type of incident.

P1 – Enact the Disaster Recovery Plan where recovery of system functionality is required, and/or the Business Continuity Plan where operations are halted and AssuranceLab’s reputation is at risk.

P2/P3 – Follow the Incident Response Plan, with reference to the Disaster Recovery Plan if system or data restoration is required. Complete a post-incident review for lessons learned.

P4 – Perform a general service desk and customer support response and raise any system issues into the product backlog for resolution.

Post-Incident Review (PIR)

Complete an assessment of the incident during or after the event to identify the cause, resolution, and any lessons learned that can prevent future occurrence. Refer to the Post-Incident Review template.

Governance

To ensure effective management of incidents, the following governance practices should be performed.

Weekly/monthly operations team meetings to review new and open incidents, those resolved since the past meeting and the progress of fixes for past incidents.
Reporting to the Senior Leadership Team on major incidents, key metrics on incidence and resolution times, and a summary of any significant lessons learned.
Annual review of the Incident Management Policy and related plans to ensure lessons learned and practice developments are accurately captured in the documentation and implemented in practice.

Incident Management Policy

AL Refs: INC02, GOV35

Purpose

Example Incident Management Policy