Skip to main content
AI Strategy

The Hidden Challenges of Taking GenAI to Production

Everyone's building GenAI prototypes. Almost no one is successfully running them at enterprise scale. Here's why.

MW
Mal Wanstall
🤖

The GenAI hype cycle has entered a fascinating phase: the demo-to-production valley of death.

Every organisation I talk to has GenAI prototypes. Chatbots. Document summarizers. Code assistants. The demos look incredible. But when I ask “How many are in production, serving real customers?”, the answer is usually silence.

This isn’t a technology problem. It’s an engineering problem. And it’s solvable.

The Four Horsemen of GenAI Production Failures

1. Cost Explosions

That prototype that costs $2 per user interaction? Congratulations, you’ve just discovered why no one puts LLMs in production without serious optimisation.

Reality check: At scale, token costs add up fast. A chatbot handling 10,000 conversations per day with GPT-4 can easily cost $50K-100K per month. Most organizations haven’t budgeted for this.

What works:

  • Model tiering: Use smaller models for simple queries, big models for complex ones
  • Caching strategies: Don’t reprocess the same context repeatedly
  • Fine-tuning: A smaller, fine-tuned model often outperforms a large generic one at 1/10th the cost
  • Prompt optimisation: Every token counts—ruthlessly minimize prompt length

At Cochlear, we reduced GenAI costs by 70% through strategic caching and model tiering. The user experience didn’t suffer—it actually improved because responses were faster.

2. Latency and Reliability

Users will tolerate 2-3 seconds for a complex query. They won’t tolerate 10 seconds. They definitely won’t tolerate timeouts and errors.

LLM APIs are:

  • Variable latency: Sometimes 1 second, sometimes 30 seconds
  • Rate limited: Hit the limits and requests fail
  • Not 99.9% reliable: Even the best providers have outages

What works:

  • Streaming responses: Start showing results immediately
  • Fallback strategies: Multiple providers, degraded experiences, cached responses
  • Circuit breakers: Fail fast and gracefully when the API is down
  • User expectations: Set clear expectations about response times

3. Quality and Consistency

LLMs are probabilistic. The same prompt can produce different outputs. Sometimes they’re amazing. Sometimes they’re embarrassingly wrong.

In production, “sometimes embarrassing” is unacceptable.

What works:

  • Evaluation frameworks: Automated testing with thousands of real-world examples
  • Human review loops: Critical outputs get human verification
  • Guardrails: Block inappropriate outputs before they reach users
  • Confidence scoring: Know when the model isn’t sure and handle it gracefully

We run 5,000+ automated evaluations on every model update. It catches problems before users see them.

4. Data Privacy and Security

Sending customer data to external LLM APIs creates regulatory and security nightmares:

  • GDPR compliance: Can you send EU customer data to OpenAI?
  • Data retention: What happens to the data after the API call?
  • Sensitive information: What if users paste confidential information into prompts?

What works:

  • Data sanitization: Strip PII before sending to external APIs
  • On-premise models: For highly sensitive use cases, deploy models internally
  • Contractual protections: Ensure API providers have proper data handling agreements
  • User education: Clear warnings about what not to paste into AI tools

The Production Checklist

Before you take a GenAI application to production, honestly answer these questions:

Cost & Scale

  • Have we calculated cost per interaction at 10x current volume?
  • Do we have budget approval for these costs?
  • Have we optimised prompt and context length?

Performance

  • Are 95% of responses under 5 seconds?
  • Do we have fallback strategies for API failures?
  • Can we handle 10x traffic spike?

Quality

  • Have we tested with 1,000+ real-world examples?
  • Do we have automated quality monitoring?
  • Can we explain failures when they occur?

Security & Compliance

  • Have we completed privacy impact assessment?
  • Are we PII-sanitizing inputs?
  • Do we have audit logging for all interactions?

Operations

  • Can we deploy updates without downtime?
  • Do we have observability into model performance?
  • Is there a clear incident response plan?

The Organizations Getting It Right

I’ve seen a few teams successfully navigate this. What do they have in common?

  1. They treat GenAI as infrastructure, not magic
  2. They invest in evaluation frameworks early
  3. They optimise ruthlessly before scaling
  4. They have clear ownership and accountability
  5. They’re willing to say “not ready yet”

My Recommendation

If you’re in the prototype phase, resist the pressure to rush to production. Take the time to:

  1. Build your evaluation framework - you’ll need it forever
  2. Stress test at 10x scale - costs and performance issues appear here
  3. Get legal and security buy-in early - they’ll slow you down later otherwise
  4. Plan for failure modes - what happens when the LLM is down?

The race isn’t to be first to production. It’s to be first to sustainable, reliable, cost-effective production.

That’s how you turn GenAI from a cool demo into a competitive advantage.


Want to talk through your GenAI production strategy? I’m always up for a technical deep-dive conversation.

Share this article

Topics

AI Strategy Innovation Engineering
MW

About Mal Wanstall

VP Data & AI at Cochlear

Leading data strategy and AI implementation across enterprise healthcare. Former Westpac, where I scaled the data team from 20 to 1,000+ people. I write about AI, data leadership, and building high-performing teams.

Book Me for Speaking

You Might Also Like

Daily Newsletter

Maligned

AI news without the BS

Join 1,000+ data leaders getting daily insights on AI and tech leadership.

Daily insights. Zero spam.