AI Agents: Perfect Demo, Hidden Failures in Production

⚡

Key Takeaways

1AI agents can perform perfectly in demos but fail in production, often without prior warning.

2An 85% accuracy at each step in a 10-step workflow only guarantees an overall success rate of 19.7%.

3Production failures include context degradation, silent failures, and tool schema drift.

💡Why it matters — Companies must anticipate and manage these failures to avoid unexpected costs and maintain the reliability of their AI systems.

The Perfect Demo, the Nightmare in Production

When an AI agent is tested in a demonstration, it may seem to function flawlessly. Repeated tests, presentations to the team, and even to technical leaders often show impeccable results. Every query appears to produce the expected response. However, once deployed in production, the reality can be quite different.

A few days after deployment, a client may report that the agent has provided completely incorrect information, and with disconcerting confidence. Activity logs show HTTP 200 responses, indicating that everything seems to be working correctly. Yet, the agent may have hallucinated erroneous responses for several days without the infrastructure detecting it.

This problem does not lie in the quality of the model itself, but in the architecture that underpins it. It is an issue often overlooked, as it only becomes apparent once the agent is in production.

The Alarming Numbers

Before exploring the different types of failures, it is crucial to understand a key figure. If an AI agent achieves an accuracy of 85% per step, which is considered a good result, and the workflow consists of 10 steps, the probability of successfully completing this workflow is only 19.7%.

In this simplified model, each step is independent, and success is binary. This means that even if each step is "pretty good," about eight workflows out of ten will fail. Actual failures are often more complex, and the steps are not always independent. The architectural problem is that multi-step workflows accumulate failures. The solution lies in integrating failure management at each step, not just at the end.

The Six Modes of Failure

Failure 1: Context Degradation

In a multi-step agent workflow, the model can lose memory of past events. Each API call includes the complete history of the conversation, which lengthens with each step.

Engineers may not realize that the context does not just grow; it degrades. Datadog's "State of AI Engineering 2026" report shows that the average number of tokens in workflows of agents in production has doubled year over year for average users, and quadrupled for heavy users. As the context grows, the original instruction fades, replaced by new outputs and summaries, leading the agent to act on increasingly corrupted signals.

The agent does not signal this phenomenon. Outputs gradually become erroneous, almost impossible to detect without evaluation tools. Engineers often build agents that pass outputs between steps in the form of plain text summaries. Each summary is a lossy compression, and by step 8, the agent acts on a summary of a summary of a summary of the original instruction.

The solution is to preserve structured outputs between steps, using typed data contracts rather than natural language transmissions.

Failure 2: Silent Failures

This type of failure is particularly feared because it often goes unnoticed. Traditional monitoring does not detect agent failures. An agent that hallucinates a confident but incorrect response still returns HTTP 200. Latency and error rates appear normal, and dashboards remain green.

Latitude's research on observability in production shows that the incorrect use of tools is the most common and insidious mode of failure in production. A single malformed argument can corrupt every subsequent dependent step.

A classic example is a customer support agent responding to questions about account status. In testing, queries are in structured English. In production, they are messy, multilingual, and emotionally charged. The agent returns plausible but incorrect responses, and the only signal is a customer escalation, often well after the degradation has begun.

The solution is to add a lightweight LLM evaluation layer that assesses each agent output before it reaches the user. This fast model checks the relevance of the response to the query, its compliance with source data, and whether the language of confidence is justified.

Failure 3: Tool Execution Schema Drift

Agents call various tools like APIs, database queries, or internal services. These tools evolve, but the agent is not informed.

This is an API versioning problem. An agent does not validate that the response schema of the tool matches what it expects. When a third-party API updates its response format, or an internal service modifies a required field, the agent may receive a malformed or empty response and hallucinate a plausible response or enter a retry loop.

Datadog's "State of AI Engineering 2026" report indicates that rate limit errors accounted for nearly 8.4 million errors in their LLM observability dataset. These failures are often handled by retry logic, leading to the next mode of failure.

The solution is to explicitly validate the response schemas of tools before passing them to the model, treating tool outputs as unreliable external data.

Failure 4: Uncontrolled Execution and Cost Cascades

This failure has direct financial implications.

In November 2025, four LangChain agents entered an infinite clarification loop. An Analyzer and a Verifier exchanged requests without an orchestrator deciding to stop. Without step limits, conversation budgets, or circuit breakers, the loop lasted 11 days, costing $47,000. The first week cost $127, but the fourth week reached $18,400. No one noticed until the bill arrived.

Another case in 2026 saw an agent make 14,000 repeated calls before the account was suspended. Each LLM API call is stateless, and agents send the complete conversation history with each call. A loop at step 20 with accumulated tool outputs can exceed 50,000 tokens of input per call. At cutting-edge model prices, a single late loop step costs pennies, but those pennies add up quickly.

The solution is to impose strict budget ceilings at the agent level, circuit breakers on retry loops, and limits on the number of steps.

Failure 5: Permission Explosion

Agents need access to function, but this access can accumulate uncontrollably.

The typical pattern is that IAM roles are reused across multiple agent deployments rather than being specifically tailored to each agent. This can lead to excessive permissions, increasing security risk.

The solution is to define specific IAM roles for each agent, limiting access to only the resources necessary for their task, and to regularly reassess these permissions to avoid dangerous accumulation.