TL;DR - I added a “compare two athletes” feature to a side project. The chart was an afternoon. The honesty took the rest of the week. A rank only means something relative to who else was in the field - 1st out of 8 is not 1st out of 30 - so a naive overlay of two athletes’ placings is a confident lie. The feature I shipped refuses to declare a winner unless the two actually competed in the same event and the same category, and when they never met it says so, in a box, on purpose. This is the same discipline I keep writing about for platforms: a number out of context is dashboard theatre, and the honest move is to surface your own uncertainty instead of hiding it behind a clean line.
I run inpedana.com, a live dashboard for Italian rhythmic gymnastics results, and it is my standing boring-stack case study: Flask, SQLite, HTMX, a VM. The single most requested feature was some version of “put two girls side by side.” Coaches want it, parents want it, the athletes themselves want it. The question underneath is always the same: who is doing better?
That question is a trap, and building the feature honestly meant learning exactly where the trap is.
The naive comparison is a confident lie
The obvious build is a line chart: x-axis is time, y-axis is finishing position, one line per athlete, done by lunch. Chart.js, an array of {date, rank} per athlete, ship it.
Here is what that actually produces on inpedana:

A few deliberate choices are already in there. The y-axis is reversed, because in a placing lower is better and a human reading a chart expects “up” to mean “good”; 1st place sits at the top. The x-axis is the union of every athlete’s competition dates, not one athlete’s, so the lines share a real timeline; where an athlete didn’t compete the line spans the gap rather than inventing a zero. Hover a point and you get the placing and the cohort size - 3rd out of 24 - because that second number is the whole story.
And that is where the naive version falls apart. Two lines crossing looks like one athlete overtaking another. But a 1st place in a regional heat of 8 and a 9th place in a national final of 40 are not on the same axis at all, even though the chart cheerfully draws them one grid line apart. The line chart implies a comparison the data does not support. It is the visualization equivalent of a green dashboard over a system that is quietly on fire: it looks like an answer, which is worse than looking like a question.
The honest unit is the head-to-head
The only place a placing is directly comparable to another placing is inside the same field: the same event and the same category (in the sport’s terms, the same fascia). Same start list, same judges, same day. There, finishing ahead of someone means exactly what it looks like it means. Everywhere else, “ahead” is an artefact of who happened to show up.
So the real feature is not the overlay. It is the table of genuine encounters underneath it, and it is one SQL query with a GROUP BY doing the actual thinking:
def head_to_head_events(conn, athlete_ids):
"""Direct encounters: events + fascia where two or more of the
compared athletes actually competed in the same cohort. A rank is
only meaningful within its fascia (1st of 8 != 1st of 30), so a
genuine head-to-head is the same event_id AND the same fascia -
the only place where placing one above the other means anything."""
placeholders = ",".join("?" * len(ids))
rows = conn.execute(f"""
SELECT r.event_id, r.fascia, e.name AS event_name, e.event_date,
r.athlete_id, r.rank, r.total
FROM result r JOIN event e ON e.id = r.event_id
WHERE r.athlete_id IN ({placeholders})
ORDER BY e.event_date DESC
""", ids).fetchall()
groups = {}
for row in rows:
key = (row["event_id"], row["fascia"]) # the honesty is in this key
groups.setdefault(key, {...})["results"][row["athlete_id"]] = row
# keep only cohorts where 2+ of them actually lined up together
return [g for g in groups.values() if len(g["results"]) >= 2]
The entire correctness of the feature lives in one line: the grouping key is (event_id, fascia), not event_id. Group by event alone and you’d count two athletes who were in the same building on the same day but in different categories as having “faced each other.” They didn’t. The fascia in the key is the difference between a fact and a vibe.
The result renders as the only table on the page I fully trust:

One row per real encounter, each athlete’s placing and score, the better placing of the row highlighted. No inference, no interpolation. If you want to argue about who is better, argue here, because this is the only view where the numbers were produced under the same conditions.
When they never met, say so
Most of the time, two athletes a user picks have met plenty. Sometimes they never have: different regions, different levels, different years. The naive feature has no concept of this and will happily draw two lines and let you conclude whatever you like. The honest feature has to do the hardest thing a piece of software can do, which is admit it doesn’t know.

The chart still renders, because the individual trajectories are real and interesting. But the head-to-head table is replaced by a plain notice: these two never competed in the same category, so the chart above is indicative only - placings from different fields are not directly comparable. That box is the most important UI on the page. It is the feature refusing to answer the question the user asked, because answering it would require lying.
This is the part that took the week. Not the query, not the chart. Deciding that the correct behaviour of a comparison feature is, sometimes, to decline to compare - and then designing the empty state so that declining reads as rigour rather than a bug.
Why a gymnastics side project is a platform lesson
I keep writing about the same failure mode in different clothes, and this is it again. A DORA dashboard trending up while nothing ships. A staging environment green while production is nothing like it. A comparison chart crossing while the two data points were never on the same axis. In every case a clean visual is standing in for a claim the underlying data cannot support, and the artefact’s polish is exactly what makes it dangerous. The platform work is not producing more dashboards. It is making the systems honest about what they actually know.
The tell of an honest system is that it has a well-designed “I don’t know.” It degrades to a caveat instead of a confident wrong answer. It puts the cohort size next to the rank. It refuses to group two athletes who were never in the same field. Most tools skip this because the honest version is more work and looks less impressive in a demo - a caveat box never won a screenshot in a QBR. But the caveat box is the product. It is the difference between a tool a coach can trust and a toy that generates plausible nonsense.
It helps enormously that this runs on a boring stack. The whole “honesty engine” is a GROUP BY over a SQLite file and a conditional in a Jinja template. No pipeline, no feature store, no service to explain why the comparison microservice is returning stale cohorts. When the logic is one query you can hold in your head, you can afford to spend your thinking on whether the answer is true rather than on whether the plumbing is up. That is the entire argument for keeping the stack small: it buys back the attention you then get to spend on being right.
If you are building a platform, an internal tool, or a dashboard anyone makes decisions from, and you want the honest version - the one that surfaces its own uncertainty instead of laundering it into a clean chart - that is the kind of work I do.