ReAct Agents: 90% of Attempts Fail, Solutions Emerge

⚡

Key Takeaways

1A benchmark on 200 tasks reveals that 90.8% of attempts by ReAct agents fail due to inevitable errors.

2The issue stems from the architecture that allows the model to choose the tool name at runtime, leading to doomed attempts.

3Three structural solutions are proposed to eliminate these errors: error classification, use of circuit breakers, and tool routing in the code.

💡Why it matters — The inefficiency of ReAct agents leads to unnecessary costs and avoidable failures, impacting the performance of AI systems in production.

An alarming finding for ReAct agents

In a recent benchmark involving 200 tasks, it was discovered that ReAct agents waste an overwhelming portion of their attempts. Indeed, 90.8% of these attempts fail, not due to calculation errors, but because the system continues to attempt using non-existent tools. This phenomenon is not merely a matter of low success probability, but a guarantee of failure.

The technical context of ReAct agents

This article is primarily aimed at machine learning engineers and AI developers who use LLM agents in production environments. ReAct systems, which operate with platforms like LangChain, LangGraph, AutoGen, or other custom tool loops, rely on a prompt model where the LLM alternates between Thinking, Action, and Observation steps to accomplish tasks. However, the waste of attempts is a major issue, as the model keeps trying to use tools that do not exist.

An unexpected discovery

The discovery of this problem did not result from a simple prompt adjustment, but from an in-depth analysis of each attempt. By instrumenting and classifying each error, it became clear that the main cause was an architectural assumption: allowing the model to choose the tool name at runtime. This approach is particularly problematic because it is not visible on standard monitoring dashboards, which typically display correct success rates, acceptable latency, and a number of attempts within expected limits. What is lacking is visibility on impossible attempts from the outset.

Simulation and hallucination rates

The mentioned results come from a deterministic simulation with calibrated parameters, rather than live API calls. The hallucination rate, estimated at 28%, is based on an analysis of failure modes in GPT-4 class benchmarks. While the exact percentages may vary in production, the structural conclusions remain valid as architectural properties.

Consequences for production

In production, this means that companies are paying for attempts that will never succeed, thereby depriving resources from those that could succeed. String-based tool routing directly passes the model's output to TOOLS.get(). A hallucinated tool name returns None, consuming the attempt budget without error taxonomy and failing quietly. In contrast, deterministic routing resolves tool names from a Python dictionary at planning time, classifies errors before retrying, and makes hallucinations structurally impossible.

Solutions to eliminate waste

Three structural solutions have been identified to eliminate this problem:

Error classification before retrying: This ensures that only errors that can change are retried.
Use of circuit breakers per tool: This prevents the system from wasting attempts on non-existent tools.
Tool routing in code: By integrating routing directly into the code, the possibility of hallucination at the routing level is eliminated.

These solutions allow for a 0% rate of wasted attempts, with three times lower step variance and more predictable execution.

Fundamental principle

The underlying principle of these solutions is simple: Retrying only makes sense for errors that can change. A hallucinated tool name will not change, and thus, retrying it is a guaranteed waste. It is not a matter of hallucination frequency, but a logical property: a call to TOOLS.get("web_browser") returns None with every attempt, as the tool does not exist.

A problematic line of code

The following line of code is often present in ReAct tutorials and is responsible for this waste:

tool_fn = TOOLS.get(tool_name)   # ◄─ THE LINE
if tool_fn is None:

This line does not distinguish errors, and a tool not found resembles a transient network spike. The global retry counter consumes the budget on a non-existent tool, recording it as a failure.

Benchmark results analysis

Although the overall success rate seems satisfactory with 179 successes out of 200 tasks for ReAct, the real issue lies in the hallucination events. ReAct recorded 155 hallucinations, while the deterministic workflow completely eliminated them, revealing a hidden reliability gap.

Managing the attempt budget

ReAct agents waste the majority of their attempt budget on non-retryable errors. In contrast, the workflow ensures that each retry targets recoverable failures, highlighting a major inefficiency in the standard retry logic of agents.

Total retries: 513 for ReAct versus 80 for the workflow
Useful retries (retryable errors): 47 for ReAct versus 80 for the workflow
Wasted retries (non-retryable errors): 466 for ReAct versus 0 for the workflow
Waste rate: 90.8% for ReAct versus 0.0% for the workflow

Conclusion

The majority of ReAct's failures stem from hallucinated tool names, exhausting the overall retry budget and causing tasks to fail. In contrast, the deterministic workflow successfully completed 200 tasks out of 200, with no failures. Success rate dashboards do not highlight these hidden failures, underscoring the importance of revisiting the architecture of agents to improve their efficiency.