Every engineering team I work with answers “yes, we have monitoring” before I’ve finished asking the question. Then I ask the follow-up: the last time a customer hit a bug before you did, why?
If you can’t answer that quickly, you don’t have monitoring. You have dashboards.
This is the distinction worth getting right, because it determines whether you find out about problems on Tuesday afternoon or from a customer email at 2 a.m. on Saturday.
Monitoring vs. observability
These get used interchangeably and they shouldn’t.
- Monitoring answers questions you already knew to ask. CPU above 80%. Error rate above 1%. Disk above 90%. You knew these mattered, so you set alerts. Monitoring catches the known unknowns.
- Observability is the ability to ask new questions of your system without shipping new code. Why are 3% of European users seeing latency spikes on Tuesday afternoons? You didn’t predict that question - but if your system emits rich enough data, you can answer it. Observability catches the unknown unknowns.
Both matter. Most teams have decent monitoring and zero observability. The signal: when something weird happens, you ship a new metric or a new log statement and wait for the next incident. That’s a tell.
The three pillars (and the layer above them)
You probably know the pillars:
- Metrics - numeric, aggregated, cheap to store. Best for high-cardinality dashboards and alerts on rate/saturation. Prometheus is the open-source default.
- Logs - high-cardinality, expensive at scale. Best for forensic detail per request or per event. Loki, Elasticsearch, or one of the managed log services.
- Traces - per-request flow through your services. Best for “why was this specific request slow?” Tempo, Jaeger, or a managed APM (Datadog, New Relic, Honeycomb).
What gets missed: there’s a fourth layer above the pillars - SLOs. Service Level Objectives are the bridge between operational metrics and business outcomes. They tell you not “what’s broken” but “are we delivering on the promise our customers care about?”
A team with no SLOs is reactive. A team with SLOs has a contract - internally with product, externally with users - and an error budget that turns reliability into a tradeable currency.
The Four Golden Signals
Google’s SRE book made these famous, and they hold up. For every user-facing service, monitor at minimum:
- Latency - how long requests take. Track successful and failed latency separately; failed requests are often suspiciously fast.
- Traffic - requests per second, by endpoint and by status code. The shape of traffic tells you a lot before any individual metric goes red.
- Errors - explicit error rates, but also implicit ones (200 OK with
"error": truein the body - easy to miss). - Saturation - how full your resources are. CPU, memory, connection pools, queue depths. Saturation tends to precede everything else.
For infrastructure rather than services, the equivalent is USE (Utilization, Saturation, Errors) per resource. Same idea, different layer.
Anti-patterns that look like monitoring
I see the same five mistakes in nearly every engagement:
1. Vanity dashboards
A wall of green graphs that nobody looks at unless something has already gone wrong. If a dashboard doesn’t drive a decision, delete it. The signal-to-noise ratio of your observability stack is a real engineering concern.
2. Alert fatigue
Every alert that fires and gets ignored trains your team to ignore alerts. The right rule: every alert that pages a human must be actionable, urgent, and tied to a user-visible symptom. Everything else goes to a ticket queue or to nowhere. A noisy on-call rotation is a leading indicator of an incident waiting to happen.
3. Monitoring the wrong thing
CPU is a symptom, not a cause. Pages on CPU are pages on the wrong layer. Page on user-visible symptoms - latency, error rate, request failures - and use the resource metrics for root-cause analysis once a page fires.
4. No SLOs
Without SLOs, you can’t have a productive conversation about reliability investment. Every reliability ask becomes “we need it to be more reliable” without a number. With SLOs, the conversation is “we promised 99.9%, we’re at 99.4%, here’s the error budget, here are the choices.” Different conversation entirely.
5. Trusting the dashboard over the customer
Status page green, customer email red, who’s right? The customer, always. The mismatch is your real problem - your monitoring is measuring the wrong thing. Synthetic checks from the customer’s perspective (real-user monitoring, browser tests, geo-distributed probes) close this gap.
How to test your monitoring
The only way to know if your monitoring works is to break something on purpose. Run a small chaos exercise on a non-critical service:
- Inject latency on a downstream call.
- Drop 5% of requests.
- Saturate a connection pool.
- Take a single replica down.
Watch what fires, and how fast. If the right alert reached the right person within your SLO budget, your monitoring works. If a customer would have noticed first, it doesn’t.
Do this quarterly. Make it boring. The teams that do this catch surprising amounts of broken monitoring before the surprise costs them a real incident.
Open-source or managed?
Both work. The right question isn’t “Prometheus vs. Datadog” but “what’s the total cost of running this?”
- Self-hosted (Prometheus, Loki, Tempo, Grafana) - zero per-metric cost, but you operate the stack. At low scale, it’s cheap and pleasant. At high scale, it’s a full-time team.
- Managed (Datadog, New Relic, Honeycomb, Grafana Cloud) - per-host or per-event pricing that surprises you in bills. You don’t operate the stack. You do still own integration, dashboards, and alert hygiene.
A reasonable default: self-host for metrics if you’re already running Kubernetes (Prometheus is essentially free in that environment). Use a managed APM for traces - the operational cost of running your own distributed-tracing stack is higher than people expect.
A 30-day plan
If your monitoring needs work, here’s a sequence that gives you visible progress quickly:
- Week 1: Audit existing alerts. Delete or downgrade anything not tied to a user-visible symptom. Aim for fewer than 10 paging alerts per service.
- Week 2: Pick your three most important user journeys. Write one SLO per journey (latency, availability, or both). Pick a 28- or 30-day window.
- Week 3: Wire those SLOs into a dashboard. Hook the error budget burn rate to a paging alert.
- Week 4: Run a chaos exercise on a non-critical service. Document everything that didn’t fire when it should have. Fix it.
That’s 30 days of work that materially changes how your team experiences production.
If you’d like help running this in your environment, that’s a thing I do.