Every team I talk to is building an LLM platform right now. Some call it that. Some call it “the AI gateway”, “the prompt service”, “the agent stack”. The shape is the same: a layer between application teams and a handful of model providers, plus retrieval, plus evals, plus a slowly growing pile of glue.
I have seen this movie. We made it five years ago and called it the ML platform. Most of the mistakes that made those platforms painful between 2018 and 2022 are quietly being reintroduced now, with bigger bills, more vendors, and less institutional memory.
This post is about the mistakes we already paid for once, the LLM-shaped form they are taking the second time around, and the small set of decisions that prevent the cycle from repeating.
The promise vs. the reality
The promise of an LLM platform is clean. Application teams build features. The platform team handles models, prompts, retrieval, evals, cost, observability, governance. Everyone moves faster.
The reality, in most companies I have walked into in the last twelve months:
- Three or four LLM “platforms” inside the same org, none aware of the others.
- Prompts living in source code, in feature flags, in a Notion page, and in a CMS owned by marketing.
- Token spend doubling every quarter with no way to attribute it to a feature, a team, or a customer.
- Inference latency tuned by guesswork because nobody owns capacity planning.
- A vector database that started life as an experiment and is now load-bearing.
- “Evals” that consist of a senior engineer reading ten outputs every two weeks.
Each of those is a mistake an ML platform made first.
The mistakes, and their LLM costumes
1. The data is the system, but only the code is versioned
The ML platform mistake: every training pipeline ran on a snapshot of the warehouse, and the snapshot was not versioned. Six months later, “rerun this experiment” meant “you can’t”. Reproducibility was a press release, not a property.
The LLM version: the prompt is in Git, but the retrieval corpus is whatever it happened to be at 03:14 on a Tuesday. The function-calling schema lives in a different repo on a different release cycle. The model behind gpt-4o will be silently rolled to a new minor version next month. When a regression lands, “what did we actually send the model” has no answer.
You do not need a vector-database time machine to fix this. You need to log the full request envelope (prompt, retrieved chunks, tools, model ID, parameters) for a meaningful sample of production traffic, and you need to be able to replay it. That is one weekend of work and it pays back forever.
2. Notebook-to-prod through tribal knowledge
The ML platform mistake: a data scientist handed a Jupyter notebook to a platform engineer, who reimplemented it in a “production” language with subtle differences nobody could see. Six months later, the model in prod was not the model that was trained.
The LLM version: a prompt engineer or product manager hands over a Python script with three layers of LangChain, a custom retrieval function, and a regex for output parsing. The platform team reimplements it as a “real service”. The reimplementation drops two edge cases nobody documented. Quality regresses. Nobody is sure when.
The fix is the same one it always was. Do not let the production path diverge from the development path. Whatever runs the prompt in the notebook should be the same library, calling the same provider client, with the same parsing, as the thing in production. If you are tempted to “productionize” a script, you have already lost.
3. Cost showed up eighteen months late
The ML platform mistake: GPU spend went vertical, FinOps arrived a year and a half after the spend did, and by then the bills were so entangled across teams that attribution was a forensic project, not a query.
The LLM version is unfolding right now. Token spend doubles, then doubles again. The CFO asks who is spending it. The answer is “the AI team”, which is everyone. There is no per-feature, per-tenant, per-request cost tag because nobody put one in.
The fix is boring and cheap. Every LLM call goes through one client, yours, not the vendor’s. Every call carries a small structured tag: team, feature, tenant, request type. Costs roll up by tag. If you do this on day one, you will be having a different conversation in twelve months than the rest of the industry.
4. Latency was an afterthought until it was an outage
The ML platform mistake: models worked fine on a laptop, then could not hit p99 in production because nobody had thought about inference servers, batching, or autoscaling until launch week.
The LLM version: a chain-of-three-prompts feature is fine in dev with one user. In production, with concurrent traffic, the third prompt waits behind a rate limit, the retry storm hits the vector DB, and the page that loads in two seconds for the demo loads in twelve for paying customers.
LLM latency is not your model provider’s problem to solve, it is yours. That means measuring p50/p95/p99 of the user-perceived operation (not the model call), budgeting how much of that budget you spend on each hop in a chain, and treating streaming as a UX tool, not a performance fix.
5. Observability bolted on after the fact
The ML platform mistake: inference logging was a footnote. Drift went undetected for months because nobody had logged the inputs the model was actually seeing in production, and nobody had decided what “the same” meant for them.
The LLM version: the only thing logged is the final string the user saw. Not the system prompt version. Not the retrieved chunks. Not the tool calls. When a customer says “this got worse last week”, you have no way to know whether the prompt changed, the retrieval index changed, the model rolled, or the user’s question changed.
This is a monitoring post in disguise. The principle does not change because the workload is “AI”: pick the small set of signals that would actually tell you something is wrong, log them on purpose, and look at them regularly. The novelty is in what you log. The discipline is the same one we already failed at twice.
6. The feature store became the feature swamp
The ML platform mistake: feature stores were built on the promise of reuse. In practice they became ungoverned dumping grounds where no one could tell which features were live, which were stale, and which were one experiment from 2019 that nobody had the authority to delete.
The LLM version is the prompt library, the agent registry, and the RAG corpus, all sliding into the same fate. Three teams have a “summarise this document” prompt. None of them know about the others. Two of them are slightly wrong in slightly different ways. The corpus has documents from a customer who churned in 2024.
You cannot fix this with tooling alone. You need an owner. A platform that has no human accountable for cleaning the swamp is going to grow a swamp. Pick someone. Give them the authority to delete things. The boring part is the whole job.
7. Two parallel CI/CDs that never converge
The ML platform mistake: ML had its own deployment pipeline that did not talk to the rest of engineering. Code review, security scanning, change management, rollback, all reinvented, badly, in a separate stack.
The LLM version: prompts deploy through a UI. Agent configs deploy through a different UI. Model selections live in a third place. The rest of the application has a CI pipeline that does none of these. Nobody can answer “what changed in production today”.
If your engineering culture is “everything ships through Git”, do not make an exception for prompts. The convenience of editing prompts in a vendor’s web UI is not worth the auditability you give up. Put them in code, version them, deploy them through the same pipeline as everything else, even if the pipeline has to gain a new step.
8. Governance was a registry, not a lifecycle
The ML platform mistake: there was a model registry. There was no model retirement. Models stayed in production long after the team that built them had moved on, becoming the most expensive kind of legacy code: the kind nobody admits is legacy.
The LLM version is already here. Prompts that nobody owns. Function tools wired to deprecated internal APIs. Model versions pinned to a snapshot the provider has announced they are sunsetting. A registry full of entries, a lifecycle nobody runs.
Governance is a process, not a table. Decide, in writing, who is allowed to deprecate a prompt, on what cadence you review what is in production, and what happens when a model is sunsetted. If you do not, your LLM platform will accumulate the same dead weight your ML platform did.
The mistake under the mistakes
Look at that list. Versioning, handover, cost, latency, observability, registries, pipelines, governance. None of these are AI problems. They are platform engineering problems. We solved them, imperfectly, the first time around. We are about to discover we forgot the answers.
The reason is structural. LLM work, in most companies, is sitting outside the platform team. It is owned by an “AI team” or a product team with a budget, building on top of provider SDKs, sometimes in a language the platform team does not even use. The platform group is asked to help six months in, after the shape has hardened and the bills have started to scare someone.
That gap is the mistake under the mistakes. Every specific failure in the list above gets worse the longer the LLM stack lives outside the platform.
What doing it differently looks like
Not a green-field rebuild. The lessons are smaller than that. In the engagements where I have seen LLM work go well, the same four habits show up:
- One client, one envelope. All LLM traffic goes through one internal client. Every call carries the same structured metadata. Cost, latency, and tracing fall out of this for free.
- The prompt is code. Prompts live in source control, are reviewed like code, deploy like code, and roll back like code. No separate UI lifecycle.
- Log the full request, sample the full response. Enough to replay any production interaction within reason. Storage is cheap; the day you need it, this is the only thing that matters.
- One person who can say no. A named owner for the LLM platform with the authority to deprecate prompts, retire tools, and sunset models. Not a committee. A person.
None of this is novel. Three of the four are things a competent platform team already does for every other workload. The novelty is admitting that LLMs are not special enough to deserve an exception, and not boring enough to be ignored.
The shorter version
The LLM platform is a platform. It will reward the same habits every other platform rewards, and punish the same neglect. The companies that come out of the next two years with a working LLM stack will not be the ones that adopted the most agent frameworks or the trendiest vector database. They will be the ones that treated this workload like every other one they have ever shipped.
We have done this before. We know how it ends if we do not.
If your team is building on top of LLMs and the platform underneath them is starting to look more like an experiment than a service, get in touch. Most of the engagements I do here start with a 30-minute conversation about what is actually broken, not what is on the slide.