Every major technology conference in the last 18 months has featured a demo of an AI agent doing something impressive. Booking a flight. Writing and executing code. Managing a customer enquiry from start to finish.
Most of those demos are not production systems.
I've built agentic AI in production, at a European insurance brokerage at enterprise scale, with 67% autonomous resolution rates on real customer service cases. I've also watched a significant number of agentic AI projects fail in ways that were entirely predictable. This is an honest account of what the difference looks like.
What "agentic AI" actually means
An AI agent, in the technical sense, is a system that can take actions, not just generate text. It can retrieve information, execute tools, make decisions across multiple steps, and produce outcomes in the world rather than just producing responses.
The demo version of this: an agent that can look up flight prices, check calendar availability, book a seat, and confirm the booking in a single conversational interaction.
The production version of this: an agent that can do the above 10,000 times per day, on inputs that don't match the training distribution, with graceful handling of edge cases, a full audit trail of every decision, and escalation logic that surfaces the cases the agent shouldn't handle autonomously.
The gap between those two descriptions is where most agentic AI projects fail.
Why most agents don't work in production
Problem 1: Retrieval is the weakest link
Agents are only as good as the information they can access. And in almost every real-world deployment, retrieving the right information, from the right source, in the right format, at the right time, is harder than building the agent logic.
Vector databases are not magic. Embedding quality degrades with domain-specific content. Retrieval precision drops sharply when the knowledge base is large, inconsistent, or poorly structured. The agent that works beautifully on clean, curated documentation fails silently on the messy reality of actual enterprise data.
The solution is to treat retrieval as an engineering problem of its own, with the same rigour as any other part of the system. Chunking strategies, embedding models, hybrid retrieval combining semantic and keyword search, metadata filtering, reranking. These are not implementation details; they are load-bearing engineering decisions.
Problem 2: Error propagation in multi-step pipelines
Single-step AI is relatively forgiving. If the model makes an error, you see it, you correct it, you move on.
Multi-step agentic pipelines are not forgiving. Each step's output becomes the next step's input. Errors compound. A misclassification in step 1 propagates through the entire pipeline and produces a confident, coherent, wrong answer at the end.
Production agents need explicit error detection at each step. They need to know when they're uncertain. They need to know when the input doesn't match the patterns they were built for. And they need escalation logic that fires reliably when these conditions occur. Not optimistically, but conservatively.
Problem 3: The state management problem
Conversational agents need to maintain context across a session. Orchestration agents need to manage state across multiple tool calls. Long-running agents need to persist state across interruptions and restarts.
State management in agentic systems is genuinely hard. The naive approach, passing the entire conversation history as context on every API call, doesn't scale and degrades performance. The sophisticated approach (explicit state machines, persistent state stores, careful context window management) requires engineering that most early-stage AI projects don't budget for.
Problem 4: Testing is much harder than for traditional software
You can write unit tests for functions. You can write integration tests for APIs. Testing an AI agent that takes non-deterministic multi-step actions is a different problem.
How do you test a system where the same input can produce different outputs? How do you build a test suite that catches regressions in natural language reasoning? How do you know your agent will handle the edge cases it hasn't seen during development?
The answer involves a combination of golden-set evaluation (specific inputs with expected outputs), adversarial testing (deliberately trying to break the agent), and production monitoring (catching failures when they happen and feeding them back into the development process). None of this is impossible, but all of it requires deliberate engineering investment.
Problem 5: Trust and adoption
Even when the agent works technically, getting people to trust it, and therefore use it, is its own challenge.
A customer service agent that resolves 67% of cases autonomously is only valuable if the remaining 33% that require human handling are escalated smoothly, with full context, to agents who trust the system enough to take the handoff efficiently.
Building this trust requires transparency: the human needs to understand why the agent is escalating, what it's already determined, and what information it retrieved. Black-box escalation ("here's a case the AI couldn't handle, good luck") destroys the value of the automated component.
What makes agentic AI actually work in production
Based on the systems I've built, the patterns that distinguish production-ready agents from demo-ready agents are:
Retrieval engineering as a first-class concern. Not an afterthought. Not a vector database with default settings. A carefully designed retrieval architecture with tested precision and recall at real-world data scale.
Conservative escalation logic. Better to escalate unnecessarily than to resolve incorrectly. The agent should know what it doesn't know, and be honest about it.
Explicit state management. Every step of the pipeline has observable state. Every decision is logged. Every error is caught and categorised.
Staged rollout with real metrics. Start at 10% of production volume. Measure resolution quality, escalation rate, user satisfaction, and business outcomes. Scale when the numbers are good.
Human-in-the-loop design that respects human time. The escalation handoff should give the human everything they need to pick up efficiently, not a blank canvas.
Production monitoring that catches drift. AI system performance degrades over time as input distributions change. The system needs to monitor for this and surface it proactively.
What this means if you're deploying agents
If you're building agentic AI in 2026, the question "can our agent do X?" is the wrong starting point. The question is "can our agent do X reliably, in production, at the scale we need, with the error handling and auditability our customers require?"
Most demos say yes to the first question. Very few production systems actually deliver on the second.
The organisations I work with who get this right are the ones who treat the production engineering question as seriously as the model selection question. From day one, not after launch.
Building an agentic AI product and want a second opinion on your architecture approach? Let's talk.
Related: Why 80% of AI projects fail to deliver ROI · Why your AI spend isn't showing up in the numbers · How I approach production AI deployment