Overview

Teams running tech services today are expected to maintain 24/7 availability.

When something goes wrong, whether it's an outage or a broken feature, team members need to respond immediately and restore service. This process is called incident management, and it’s an ongoing, complex challenge for companies big and small.

We want to help teams everywhere improve their incident management. Inspired by teams like Google, we've created this handbook as a summary of Atlassian's incident management process. These are the lessons we've learned responding to incidents for more than a decade. While it’s based on our unique experiences, we hope it can be adapted to suit the needs of your own team.

Our incident values

A process for managing incidents can't cover all possible situations, so we empower our teams with general guidance in the form of values. Similar to Atlassian's company values, our incident values are designed to:

StageIncident ValueRelated Atlassian ValueRationale
1. DetectAtlassian knows before our customers do

Build with Heart and Balance

A balanced service includes enough monitoring and alerting to detect incidents before our customers do. 

The best monitoring alerts us to problems before they even become incidents.

2. RespondEscalate, escalate, escalate 

Play, As a team

Nobody likes being woken up and we don’t take the responsibility lightly. But people understand that occasionally they will be woken for an incident where it turns out they aren't needed. What’s usually harder is waking up to a major incident and playing catch up when you should have been alerted earlier.

We won't always have all the answers, so "don't hesitate to escalate."

3. RecoverShit happens, clean it up quicklyDon't !@#$ the Customer

Our customers don't care why their service is down, only that we restore service as quickly as possible.

Never hesitate in getting an incident resolved quickly so that we can minimise impact to our customers. 

4. LearnAlways BlamelessOpen Company, No BullshitIncidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.
5. ImproveNever have the same incident twiceBe the change you seek

Identify the root cause and the changes that will prevent the whole class of incident from occuring again.

Commit to delivering specific changes by specific dates.

Tooling requirements

The incident management process described here uses several tools that are specific to Atlassian and can be substituted as needed:

Incident tracking

Every incident is tracked as a Jira issue, with a followup issue created to track the completion of postmortems. The process in this handbook references our heavily customized version of Jira Software, which inspired the creation of Jira Ops. As such, the process doesn't exactly match the functionality available in Jira Ops today.

Incident issues are typically created by a support engineer in response to a customer ticket or by a developer recognizing a monitoring alert as being an incident. We urge people to create an issue if they're worried about something, rather than wait to escalate it.

In Jira, we have a simple workflow to track incidents through the resolution stage and to record all important actions taken during the incident response.

Incident manager

Each incident is driven by the incident manager (IM), who has overall responsibility for and authority for the incident. This person is indicated by the assignee on the incident issue. The incident manager is empowered to take any action necessary to resolve the incident, which includes paging anyone in the organization and keeping those involved in an incident focused on restoring service as quickly as possible. 

The incident manager is a role, rather than an individual on the incident. The advantage of defining roles during an incident is that it allows people to become interchangeable. As long as a given person knows how to perform a certain role, they can take that role for any incident.