The Pilot-to-Production Chasm | Research

The Observation

Every large organisation I talk to has AI pilots. Most of them are described as successful. And yet the conversion rate from successful pilot to production system remains stubbornly low. The commonly cited figure is that somewhere around 85-90% of AI models never make it to production. The exact number varies depending on who’s doing the citing, but the pattern is consistent enough that it should be setting off alarms.

It isn’t.

Instead, organisations keep launching more pilots. The logic seems to be that if you run enough experiments, some will stick. This is the data equivalent of buying lottery tickets as a retirement strategy. It mistakes volume for capability.

The conventional explanation for the pilot-to-production failure rate is technical. The model needs re-engineering for scale. The data pipeline isn’t production-grade. The infrastructure can’t handle real-time inference. These explanations are real, but they’re the surface layer of a deeper problem. The deeper problem is organisational, and it follows the same diagnostic patterns I’ve documented in the strategy-execution gap.

AI pilots succeed because they sidestep the hard organisational problems. Production AI fails because it can’t.

Why Pilots Succeed (and Why It Doesn’t Matter)

A well-run AI pilot has several structural advantages that have nothing to do with the quality of the model:

Curated data. The data science team hand-selects and prepares the training data. They know its quirks and compensate for them. In production, the model needs to consume from whatever Tier 1 foundational data products the organisation actually has. Not from a dataset that took a specialist three weeks to clean by hand.

Dedicated attention. Pilots get the best people, working full-time on a single problem. Production systems need to be maintained by operational teams who are also responsible for dozens of other things and who weren’t part of building the model in the first place.

Forgiving success criteria. A pilot that improves on the baseline by any measurable amount is declared a success. A production system needs to be reliable, explainable, monitored, and maintained indefinitely. The gap between “it beat the baseline in a controlled test” and “it operates reliably at scale under real-world conditions” is enormous.

Absent governance. Pilots typically run outside normal governance, procurement, and security processes. This gets presented as agility. What it actually means is that the pilot never had to navigate the organisational machinery that production deployment requires.

The result is a form of selection bias that the organisation mistakes for capability. The pilot proved the model works. What it didn’t prove, and what it was never designed to prove, is that the organisation can operate it.

The Chasm

The gap between pilot and production is not something you bridge with better MLOps tooling, although that helps. It’s a chasm with distinct layers, each representing a category of organisational problem that pilots are specifically designed to avoid.

flowchart TB
    P["Pilot Success\nCurated data, dedicated team,\nforgiving success criteria"]
    L1["Layer 1: Data Foundation Gap\nProduction data ≠ pilot data"]
    L2["Layer 2: Operational Ownership Gap\nNo one owns it in production"]
    L3["Layer 3: Governance and Risk Gap\nCompliance wasn't part of the pilot"]
    L4["Layer 4: Process Integration Gap\nBusiness processes haven't changed\nto use the model's output"]
    PR["Production Operation\nReliable, governed, integrated\ninto business process"]
    P --> L1 --> L2 --> L3 --> L4 --> PR

Layer 1: The Data Foundation Gap

This is the most common failure layer and the most predictable. The pilot used a curated, static dataset. Production requires live data flowing through organisational data products that may not exist, may not be at the required quality, or may not have the governance needed to be trusted at scale.

I’ve seen this repeatedly: a model that performs brilliantly on a hand-prepared training set degrades significantly when it meets real production data. Duplicates, missing fields, inconsistent formats, stale records. The things a data scientist can work around in a notebook become showstoppers in a pipeline.

This is the data product taxonomy problem in sharp focus. Organisations that have underinvested in Tier 1 foundational products will hit this wall with every AI initiative they try to productionise. The pilot masks the problem. Production exposes it.

Layer 2: The Operational Ownership Gap

Pilots are owned by the team that built them. Everyone knows who to call when something breaks. Production systems need a different kind of ownership: SLAs, on-call rotations, monitoring, incident response, retraining schedules.

In most organisations, there is no natural home for a production ML model. The data science team built it but doesn’t want to operate it. The engineering team can operate systems but doesn’t understand the model. The business team sponsored it but has no technical capability to maintain it. The model falls into the space between these three teams, and that’s where it dies.

This is the same pattern I’ve described in the strategy-execution gap: failure that lives in the spaces between teams, not within them. Each team is doing its job. No single team is failing. But the model isn’t in production.

Layer 3: The Governance and Risk Gap

Pilots run in sandboxes. Nobody asks about model risk, bias monitoring, regulatory compliance, or explainability requirements during a pilot because the model isn’t making real decisions about real people.

The moment you move to production, every one of those questions becomes urgent. In financial services, model risk frameworks require documentation, validation, and approval processes that can take months. In healthcare, the regulatory bar is higher again. These aren’t bureaucratic obstacles. They’re legitimate requirements that exist for good reasons. But pilots that were scoped without considering them face a rude awakening when the governance team gets involved at deployment time.

The organisations that handle this well treat governance as a design input, not a deployment gate. They involve risk and compliance early, so the model is built to be governable. The ones that treat governance as something to deal with later spend six months in approval purgatory after the pilot “succeeds.”

Layer 4: The Process Integration Gap

This is the least discussed layer and possibly the most important. Even when a model reaches production, it only creates value if the business processes around it have been redesigned to use its output.

A churn prediction model that produces a daily list of at-risk customers is worthless if the retention team’s workflow doesn’t incorporate that list. A demand forecasting model is worthless if the supply chain planning process still runs on spreadsheets and gut feel. The model isn’t the product. The changed business process is the product.

Most pilot business cases assume this integration will happen naturally. It doesn’t. Business process change requires the same leadership attention, change management, and incentive alignment as any other organisational transformation. Bolting an AI model onto an unchanged process is like fitting a jet engine to a bicycle. The engineering is impressive. The outcome is useless.

The Diagnostic

The following quadrant maps where most organisations sit. The horizontal axis is organisational readiness (data foundations, ownership models, governance maturity). The vertical axis is pilot activity volume. Most large enterprises are in the top-left: lots of pilots, very little readiness to productionise them.

quadrantChart
    title Pilot-to-Production Maturity
    x-axis Low Organisational Readiness --> High Organisational Readiness
    y-axis Few AI Initiatives --> Many AI Initiatives
    quadrant-1 "Scaled AI"
    quadrant-2 "Pilot Purgatory"
    quadrant-3 "Early Stage"
    quadrant-4 "Disciplined Foundation"
    "Most Large Enterprises": [0.25, 0.78]
    "Target State": [0.78, 0.65]
    "Cautious Starters": [0.40, 0.18]

The path from Pilot Purgatory to Scaled AI is horizontal, not vertical. You don’t get there by running more pilots. You get there by building the organisational readiness that makes production deployment possible.

If you suspect your organisation is stuck in that top-left quadrant, there are four questions that will tell you quickly:

1. How many of your AI pilots used production data sources? If the answer is “none” or “very few,” your pilots aren’t testing what matters. They’re testing whether a model can learn patterns in clean data, which is a solved problem.

2. Who owns your AI models after the data science team moves on? If there’s no clear answer, or if the answer is “the data science team, I guess,” you have an ownership gap that will kill production deployment.

3. At what point in the pilot lifecycle does governance get involved? If the answer is “at deployment,” your governance process will become a bottleneck that frustrates everyone. If the answer is “what governance?,” you have a different kind of problem entirely.

4. Can you name a business process that has been redesigned to incorporate AI model output? Not a process that receives model output, but one that has fundamentally changed how it operates because of it. If you can’t, your AI investment is producing interesting information that nobody is structurally required to act on.

What Actually Works

The organisations I’ve seen cross the chasm successfully share a few common characteristics. None of them are revolutionary. All of them require discipline.

They fund Tier 1 data products before they fund AI pilots, or at minimum in parallel. They accept that the foundational investment is prerequisite, not optional.

They establish production ownership models before the pilot begins. The question “who will operate this in production?” gets answered in the pilot proposal, not after the demo.

They embed governance from the start. Not as a gate, but as a design constraint that shapes how the model is built, documented, and monitored.

They measure pilot success on production-readiness criteria, not just model performance. A pilot that produces a great model but has no path to production hasn’t succeeded. It’s produced a demo.

And they limit the number of concurrent pilots. Running twenty pilots in parallel feels like progress. It’s not. It’s spreading scarce organisational capability across too many initiatives, guaranteeing that none of them get the depth of investment needed to cross the chasm. Five pilots with proper resourcing will produce more production AI than fifty that are fighting for the same data engineers.

This analysis draws on experience scaling AI from pilot to production at Westpac (where the pilot-to-production conversion rate was a persistent leadership challenge across the data and AI organisation) and at Cochlear (where regulated healthcare AI demands production-grade rigour from the outset, making the chasm visible earlier and forcing the organisation to confront it honestly).