The Postmortem That Changed Nothing

Every team I work with does postmortems. Most of them are written, filed, linked in a Confluence index, and never opened again. Then, eighteen months later, the same incident happens. Different on-call engineer, same root cause, same surprised faces in the retro.

We did not fail to write the document. We failed to change the system. That gap is the most expensive habit in incident response, and almost nobody talks about it honestly.

The postmortem that produced a doc, not a change

This post is about why postmortems so often produce paper instead of change, what they are actually for, and the small set of habits that turn an incident review into a system that learns.

The postmortem promise vs. the postmortem reality

The promise is clean. Something broke, we gather the team, we write down what happened, we extract action items, we get better. Each incident is a free lesson the system pays for once.

The reality, in most organisations I have walked into:

  • The document is the deliverable. Once it is written, the work feels done.
  • The action items are vague (“improve monitoring”, “more training”) and have no owner, no deadline, and no place in the next sprint.
  • Nobody reads old postmortems before writing a new one. The pattern across incidents is invisible.
  • “Blameless” gets misread as “consequence-less for the system.” We protect the engineer (good) and accidentally protect the broken process they collided with (bad).
  • A senior person edits the timeline into a clean narrative. The rough edges, the moments of confusion, the bad decisions made with good information - all the parts you would actually learn from - get smoothed away.

The output looks like learning. It is, in fact, a ritual. The team feels better. The system is unchanged. Six quarters later, the same outage is back with a new ticket number.

What a postmortem is actually for

It is worth being honest about the job.

A postmortem is not for assigning blame. We agree on that. The harder thing to admit is that it is also not for the people in the room. The people in the room already lived it. They know what happened.

A postmortem is for the system. Specifically, for three audiences:

  • Future on-call engineers who will hit something nearby and need to know the shape of the failure mode, not just the eventual fix.
  • The next planning cycle, where a real action item competes for engineering time against a backlog full of features. If the action item is not in that competition, it does not exist.
  • The pattern across incidents, which only emerges when someone reads five postmortems in a row and notices that three of them blame “DNS” and two of them blame “the same retry library.” The individual postmortems cannot find this. The meta-review can.

If your postmortem does not serve at least one of those audiences, it is a journal entry. There is nothing wrong with journaling. It just is not how organisations learn.

Why most postmortems change nothing

Five patterns, in roughly the order I see them.

The document is the deliverable. The Confluence page gets written, reviewed, polished, linked. The action items at the bottom sit there in their little table, owned by “TBD” or by a team rather than a person. The doc is filed. The team rotates onto features. Three months later nobody can find it without searching for the date.

Action items are too vague to be done. “Improve monitoring” cannot be completed. “Add an SLO on checkout-service p99 latency, alerting at 500ms over 5m, owned by the payments team, in next sprint” can be completed. The first one will be in the postmortem. The second one will not. Guess which one happens.

Action items live in the wrong system. A list in the postmortem doc is not a list of work the team has agreed to do. Work the team has agreed to do lives in the same place all other work lives. If your action items are not in Jira, Linear, GitHub issues, or whatever your team actually uses, they will not get done, no matter how senior the person who wrote them down.

5-whys becomes performance. Done well, 5-whys is a tool to push past the first plausible explanation. Done badly, it becomes “let us write five lines of plausible-sounding causes to satisfy the template.” The fifth line in most postmortems I have read is some variant of “we did not have a process for this.” That is not a root cause. That is a placeholder.

Nobody is reading the back catalogue. The new incident is treated as if it happened in a clean room. The team that gets paged at 3am does not look at the four previous related postmortems, because nobody has indexed them by failure mode. They reinvent half the diagnosis. They probably write half the same action items, which will also not get done.

The common thread: the postmortem succeeds as a document and fails as an intervention.

What turns postmortems into change

Here is the short version of what I have seen actually work. None of it is novel. Most of it is routinely skipped.

What turns postmortems into change

1. A timeline that is rough on purpose

The timeline is the most important part of the document, and the part most often sanitised. Resist the urge to clean it up.

You want the messy version. The 3:14 message where someone says “I think it might be the cache, let me check” and is wrong. The 3:22 confusion about which dashboard is the real dashboard. The 3:37 decision to roll back that, in retrospect, was correct but was made for the wrong reason. Those moments are the artefacts. They show how the system looked to the humans inside it, which is what determines whether the next on-call will recognise the shape of the failure.

A clean timeline is a press release. You are not writing a press release.

2. At most three action items, each one a real ticket

The default in most templates is a table of N action items, where N is “however many sounded reasonable in the meeting.” This is a mistake. The team has a fixed amount of attention. If you produce twelve action items, you have produced zero, because nobody will prioritise them and they will rot.

Pick the three changes that would have meaningfully altered this incident. Maybe two. Sometimes one. Each one becomes a ticket in the same system that holds all other work, with:

  • A named owner (a person, not a team).
  • A target sprint or week.
  • A clear acceptance criterion: not “investigate”, not “consider”, not “explore”. A specific thing that will be true when this is done.

If you cannot find three changes that meet this bar, ship one and be honest that the others are “things to remain aware of,” not work the team is committing to.

3. “What would have caught this earlier?” beats “what was the root cause?”

Root-cause framing is seductive and often misleading. Complex systems fail through chains of conditions, not single causes. The more useful question is forward-looking: at what point would a different signal, control, or default have stopped this from becoming an incident?

This question maps cleanly to the production-safety stack I have written about elsewhere. The answers cluster:

  • “An SLO on this signal would have paged us before the customer noticed.”
  • “A canary on the deploy would have caught the bad config in the first 1%.”
  • “A feature flag on this code path would have let us turn it off instead of rolling back.”
  • “A circuit breaker on this dependency would have degraded gracefully instead of cascading.”

Those answers turn into real action items. “We did not understand the system” does not.

4. The meta-postmortem nobody runs

Once a quarter, somebody reads every postmortem from the last three months in one sitting and looks for patterns. Not to assign blame. To find the failure modes that are showing up more than once.

This is the highest-leverage incident-review activity I know about, and almost no team does it. The patterns are obvious if you read them as a batch and invisible if you read them one at a time. Three incidents involving the same flaky dependency. Two incidents where the on-call engineer had been pulled in from another team and didn’t know the runbook. Four incidents that happened on Friday afternoons. You will not see those by reading one postmortem at a time.

The output of the meta-postmortem is not another document. It is one or two structural changes proposed to engineering leadership, with the pattern that justifies them.

5. The doc is not the deliverable. The closed ticket is.

A postmortem with all action items closed is a successful postmortem. A postmortem with three open action items in month four is a failed postmortem, no matter how well-written the doc is.

Track this. Some teams I have seen put a small dashboard on the wall: incidents this quarter, action items open vs. closed, oldest unclosed action item. It looks dorky. It also closes the gap between “we wrote a postmortem” and “we changed the system.”

The honest version of “blameless”

Blameless culture exists because shame and fear destroy honest information flow. That is real, and abandoning it is not what I am suggesting.

But the version of blameless I see most often goes one step further than the original idea and lands somewhere unhelpful. It treats every contributing factor as if it had no agent. The deploy “went out.” The alert “didn’t fire.” The runbook “wasn’t followed.”

That is not blameless. That is passive voice as a defence mechanism.

The version that works keeps the protection for the person and removes it from the process. We do not blame the engineer who ran the bad command. We absolutely interrogate the system that let the bad command run without a confirmation prompt, against a production database, at 11pm, with no second pair of eyes. Those are different things. Conflating them is how organisations get stuck.

The short version

If your team writes postmortems that nobody reads and ships action items that nobody closes, you do not have a postmortem culture. You have a postmortem ritual. The ritual is comforting. It is also why the same incident keeps coming back.

The fix is not a better template. It is:

  • A timeline rough enough to be useful.
  • At most three action items, each one a real ticket with a named owner.
  • A forward-looking question, not a root-cause hunt.
  • A quarterly read of the back catalogue to find the patterns nobody notices in isolation.
  • A definition of done that is “the action item shipped,” not “the doc was published.”

That is the boring version. It is also the version that lets the system actually learn.

If your team is running postmortems that are not changing anything - or you can feel an incident pattern in your stack that you cannot quite name - let’s talk.