DORA Metrics Without the Dashboard Theatre

TL;DR - The four DORA metrics are good. What most teams do with them is theatre: a dashboard that trends up and to the right while nothing about how software actually ships changes. The metrics are a thermometer, not a treatment. Measuring deployment frequency doesn’t make you deploy more often any more than weighing yourself makes you thinner. This post is the four metrics read honestly, the four ways teams quietly game them, and the single question that separates a real DORA practice from a screenshot in the quarterly review: what did this number cause us to change?

The dashboard versus the practice

I like the DORA metrics. After years of teams arguing about productivity with no shared language, four numbers that correlate with actually shipping good software is a gift. The platform engineering pillar names them as how you know the work is paying off, and I stand by that.

The problem isn’t the metrics. The problem is what happens to a metric the moment it goes on a dashboard and gets a leadership audience. It stops being a diagnostic and becomes a performance. The line has to go up because the QBR is on Thursday, and the fastest way to make a line go up is rarely the same as the work the line was supposed to measure. This is the same failure mode I wrote about in the postmortem that changed nothing: the artefact gets produced, the ritual gets performed, and the underlying thing the artefact was meant to improve stays exactly where it was.

The four metrics, read honestly

The four metrics split into two pairs, and the whole point is the tension between the pairs. Velocity without stability is just shipping bugs faster. Stability without velocity is a system so frozen nothing can break because nothing moves. A real DORA practice watches the balance, not any single number.

  • Deployment frequency - how often you ship to production. High is good, but only paired with the stability metrics. A team deploying fifty times a day with a 30% change-failure rate is not elite, it’s reckless with good tooling.
  • Lead time for changes - commit to production. This is the one I trust most, because it’s the hardest to fake and it captures the whole pipeline: review, CI, staging, approvals, the lot. Long lead time is where your golden path earns its keep or doesn’t.
  • Change failure rate - the percentage of deploys that cause an incident. This is where deployment frequency gets its conscience. It’s also the number most quietly fudged, because what counts as a failure is a definition you control.
  • Time to restore - how fast you recover when a deploy breaks. Pairs with change failure rate to answer the only question users care about: when it goes wrong, how long am I living with it?

Notice none of these is “lines of code”, “story points”, or “commits per engineer”. DORA’s quiet genius is that all four are outcomes - they measure the system’s behaviour, not the humans’ activity. The moment you start optimising the humans’ activity to move them, you’ve broken the instrument.

The four metrics as two pairs in tension

The four ways teams game them

Every one of these I’ve watched happen, usually without anyone deciding to be dishonest. That’s the thing about dashboard theatre - it’s rarely fraud. It’s a team responding rationally to “make this number go up” instead of “make this thing better”.

  • Redefining “deploy”. Deployment frequency is low, so the definition of a deploy expands. A config tweak is a deploy. Re-running the pipeline is a deploy. Suddenly you’re “elite” and you ship to actual users exactly as often as before. The number moved; nothing shipped.
  • Redefining “failure”. Change failure rate is embarrassing, so the bar for “failure” rises. A rollback isn’t a failure if we caught it in canary. A 3am page isn’t a failure if there was no customer ticket. Narrow the definition far enough and your change-failure rate is a beautiful, meaningless zero.
  • Restarting the clock. Time to restore looks bad, so “restore” becomes “mitigated” becomes “the alert stopped firing”. You measure how fast the dashboard went green, not how fast the user stopped suffering. These are not the same event, and the gap between them is exactly the part worth measuring.
  • Optimising the measurable slice. Lead time is measured from merge, so all the cost migrates upstream of the merge - into a two-week review queue and a manual change-approval board that the metric can’t see. The visible pipeline is lightning fast. The actual commit-to-customer time is worse than before.

The pattern underneath all four is Goodhart’s law: when a measure becomes a target, it stops being a good measure. A dashboard with a leadership audience and no curiosity behind it is a target factory. It manufactures targets faster than it manufactures improvement.

What a real DORA practice looks like

The difference between theatre and practice is not the dashboard. You can have the exact same four charts in both. The difference is what the numbers are for.

In a theatre, the number is the deliverable. Someone gets the metrics green, screenshots them, and the work is “done” - in precisely the way a developer portal can look done while helping zero developers. In a practice, the number is a question. A bad lead time isn’t a red cell to explain away; it’s a prompt to go find the slowest stage and fix it. The metric is the start of the work, not the end of it.

A few things that, in my experience, tell the two apart:

  • Every metric has an owner who can change it. A change-failure rate nobody can act on is decoration. If the team watching the number has no authority over the pipeline, the review process, or the deploy mechanism, you’re measuring a team’s weather, not its work.
  • You instrument the metric, not the human. Pull deployment frequency from the deploy system, lead time from Git and CI timestamps, restore time from the incident tool. The second a number depends on someone remembering to log their activity, it’s measuring compliance with the logging, not the thing.
  • You read the trend, never the snapshot. A single quarter’s numbers are noise dressed as signal. The question is never “is lead time good?” - it’s “is lead time better than last quarter, and did something we did cause that?” If you can’t name the change, you don’t have a practice, you have a chart.
  • The four move together or you ask why. Deployment frequency up while change failure rate is also up is not a win, it’s a warning. The pairs are a system of checks. A practice treats a divergence as the interesting signal; theatre celebrates whichever half looks good and crops out the rest.
  • You measure restore from the user’s clock. Start the timer when the user starts hurting, stop it when they stop. Not when you noticed, not when the alert cleared. This is the same discipline as real observability versus a wall of dashboards: the question is what the system did to the people using it, not what your tooling found convenient to record.

The one question

If you keep one thing from this, keep the question that ends every honest metrics review:

What did this number cause us to change?

If the answer is “nothing, but it’s green”, you don’t have a DORA practice. You have a wallpaper that happens to be made of metrics. Green-and-nothing-changed is not success; it’s the absence of curiosity, and it’s indistinguishable from a team that stopped looking.

If the answer is “we found the two-week review queue hiding behind a fast pipeline, and we killed it” - or “we discovered our restore time was measuring the dashboard, not the user, so we re-instrumented it and it doubled, and that’s the real number” - then the metric did its job. It’s a thermometer that prompted a treatment. That’s the whole game.

DORA didn’t promise you a scoreboard. It gave you four honest questions about how your software reaches the people who use it. Put them on a dashboard if you like - but the dashboard is where the work starts, not where it gets to look finished.


If you’re standing up platform metrics and want them to drive change instead of decorate a slide - or you’ve got four green numbers and a nagging sense that nothing actually got faster - that’s the kind of problem I help teams untangle. Work with me.