Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Guide autonomous decision-making by people and teams in incidents and postmortems. 

  • Build a consistent culture between teams of how we identify, manage, and learn from incidents.

  • Align teams as to what attitude they should be bringing to each part of incident identification, resolution, and reflection.

StageIncident ValueRelated Atlassian ValueRationale
1. DetectAtlassian knows before our customers do

Build with Heart and Balance

A balanced service includes enough monitoring and alerting to detect incidents before our customers do. 

The best monitoring alerts us to problems before they even become incidents.

2. RespondEscalate, escalate, escalate 

Play, As a team

Nobody likes being woken up and we don’t take the responsibility lightly. But people understand that occasionally they will be woken for an incident where it turns out they aren't needed. What’s usually harder is waking up to a major incident and playing catch up when you should have been alerted earlier.

We won't always have all the answers, so "don't hesitate to escalate."

3. RecoverShit happens, clean it up quicklyDon't !@#$ the Customer

Our customers don't care why their service is down, only that we restore service as quickly as possible.

Never hesitate in getting an incident resolved quickly so that we can minimise impact to our customers. 

4. LearnAlways BlamelessOpen Company, No BullshitIncidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.
5. ImproveNever have the same incident twiceBe the change you seek

Identify the root cause and the changes that will prevent the whole class of incident from occuring again.

Commit to delivering specific changes by specific dates.

Tooling requirements

The incident management process described here uses several tools that are specific to Atlassian and can be substituted as needed:

...