I watched a team demo an AI system to a room full of executives last year. It was genuinely impressive. Natural language queries over internal documents, accurate answers, clean interface. The room was buzzing. A senior leader turned to me and said, “How quickly can we roll this out to 5,000 users?”
My answer was not what they wanted to hear.
That demo was running on a curated dataset of 200 documents. It had been tested by the people who built it. It was hitting an API with no rate limits because there was one user. And the “production timeline” the team quoted assumed nothing would go wrong.
Six months later, the project was still in development. Not because the team was bad. Because production is a fundamentally different problem than demo.
The Demo Trap
Demos are persuasive because they show the best case. You pick the inputs you know work well. You skip the edge cases. You don’t show what happens when the WiFi drops, when someone asks a question in a format you didn’t anticipate, or when the API provider changes their pricing.
I’ve fallen into this trap myself. Early in my career I demoed a fraud detection model to the business. It performed beautifully on test data. In production, it flagged 40% of legitimate transactions as suspicious. We hadn’t accounted for seasonal purchasing patterns that weren’t in our training data. Embarrassing, but instructive.
The gap between demo and production isn’t a small gap. It’s a canyon.
What Breaks in Production
After shipping maybe two dozen AI systems to production across financial services and healthcare, here’s my reliable list of what goes wrong.
Data quality hits you first. Your training data was clean because someone cleaned it. Production data is messy. Missing fields, inconsistent formats, values that technically pass validation but are clearly wrong. I’ve seen a model fail because a date field sometimes contained the string “TBD.” Nobody tested for that.
Scale reveals cost problems. That API call costs $0.03 per request. Fine for 100 users. At 50,000 requests per day, you’re spending $1,500 daily, $45,000 a month, on a single feature. CFOs notice. We’ve had to redesign systems from scratch because the unit economics only worked at demo scale.
Users are creative in ways you can’t predict. They’ll paste entire spreadsheets into a text input. They’ll ask questions in languages you didn’t plan for. They’ll find the one combination of inputs that causes your system to return confidently wrong answers. Testing with your own team is not the same as testing with real users.
Integration is where projects go to die. The model works. The API works. But connecting it to your authentication system, your data warehouse, your logging infrastructure, your monitoring stack, and your deployment pipeline takes longer than building the model. This is consistently the most underestimated work. At Westpac, we had a model that was “ready” for five months while the integration work ground forward.
flowchart LR
subgraph Demo["Demo Environment"]
D1[Curated data]
D2[Single user]
D3[No rate limits]
D4[Happy path only]
end
subgraph Prod["Production Reality"]
P1[Messy data at scale]
P2[Auth + security]
P3[Cost management]
P4[Monitoring + alerting]
P5[Integration + deployment]
P6[Edge cases + errors]
end
Demo -->|"Timeline x 3"| Prod
Why the Hard Part Is Never the Model
Here’s something that surprises people outside of the field. The model, the actual machine learning or the LLM, is typically 10-20% of the total effort. The rest is everything around it.
Data pipelines that reliably deliver clean data. Feature stores that serve consistent features at low latency. Monitoring that tells you when something drifts. Deployment pipelines that let you roll back quickly. Documentation so the person who inherits the system in 18 months can understand it.
At Cochlear, we spent roughly three months on the model for a recent project and nine months on everything else. That ratio isn’t unusual. It’s normal.
The industry focuses on model benchmarks and architecture papers. The real competitive advantage is in the plumbing.
The Timeline Multiplier
Take whatever timeline your team estimates for going from prototype to production. Multiply it by three. You’ll be closer to the truth.
This isn’t because your team is bad at estimating. It’s because prototypes hide complexity. You discover requirements you didn’t know about. Security review takes longer than expected. The data you assumed existed doesn’t, or it exists but in a format that requires significant transformation.
I’ve tracked this across about 15 projects over the past four years. The average actual timeline was 2.8x the original estimate. One project came in at 5x. None came in under 1.5x.
What Actually Helps
A few things have reliably shortened the gap between demo and production for my teams.
Build the deployment pipeline before the model. Start with a simple model, even a rule-based one, and get it deployed end-to-end. Prove that you can get predictions from A to B reliably. Then improve the model. This approach is less exciting but far more efficient.
Test with ugly data early. Don’t wait until production to discover your data quality issues. Get a sample of real, messy, production data and test against it in week one. If you can’t get production data, synthesize the worst cases you can imagine and test against those.
Budget for the boring stuff. When you write the project plan, allocate 60% of the time and budget to integration, monitoring, documentation, and deployment. If leadership pushes back, show them the data on project timelines. The “boring stuff” is where projects succeed or fail.
Set expectations early with stakeholders. When someone sees a demo and asks for a production timeline, be honest. Walk them through what production requires. Most senior leaders appreciate honesty far more than an optimistic timeline that slips repeatedly.
The Honest Conversation
I’ve started being much more direct with stakeholders about this gap. When a team demos something impressive, I’ll say: “This is great work. Now let’s talk about what production looks like.” Then we walk through data quality, scale, cost, security, monitoring, and support.
It’s a less exciting conversation. But it’s the one that actually matters. The organisations that ship reliable AI systems are the ones having this conversation early, not the ones chasing the high of the demo.