Autonomous Agents: AI's Promises Against Real-World Challenges

⚡

Key Takeaways

1Autonomous AI agents promise to perform complex tasks, but their actual performance is often disappointing.

2Anthropic's study reveals that over 80% of internal code is generated by AI, but it calls for caution.

3Salesforce's CRMArena-Pro benchmark shows limited success rates for AI agents in realistic business environments.

💡Why it matters — Companies are investing heavily in AI, but the results highlight the need for safeguards and strict regulations.

The Promise of Autonomous Agents

In recent years, enthusiasm around generative AI has crystallized around one idea: beyond simply answering questions, AI could soon execute complex tasks autonomously. Imagine an agent capable of booking a flight, reorganizing code, or compiling a quarterly report, all by breaking the task down into steps, executing them, and verifying the achievement of the goal. This vision has sparked colossal investments, reaching several tens of billions of dollars between 2024 and 2026.

However, this promising vision often diverges from reality. The actual performance of agents, measured by rigorous benchmarks, does not meet expectations. This dissonance is particularly felt in companies that have bet on this AI revolution. For example, the Anthropic Institute published a report titled "When AI Builds Itself" in June 2026, revealing that in May of the same year, over 80% of the code merged into their internal repository had been generated by their AI, Claude. Before the launch of Claude Code in February 2025, this figure was much lower, in the single digits. By the second quarter of 2026, the average engineer was merging 8 times more code per day than in 2024, a leap attributed to the moment when models began to operate autonomously over longer time horizons.

Paradoxically, despite these advances, the founders and safety teams at Anthropic have issued public warnings about the potential dangers of advanced autonomy without strict regulation. They emphasize that the rapid adoption of these technologies often outpaces the theoretical safety measures in place.

CRMArena-Pro: A Revealing Benchmark

In June 2025, Salesforce AI Research launched CRMArena-Pro, an innovative benchmark designed to assess the capabilities of AI agents in realistic enterprise environments. Unlike previous benchmarks, CRMArena-Pro does not merely check if the AI can correctly answer a single question. It tests the AI's ability to manage a complete CRM task over multiple interactions, using real data to simulate an ongoing conversation with a human user. This benchmark utilizes over 83,000 synthetic but structurally representative records.

The results obtained with the most advanced models of the time, such as Gemini 2.5 Pro, revealed mixed performances:

For a single interaction, the success rate was 58%.
For multiple interactions, mimicking a realistic workflow, the success rate dropped to 35%.
When executing structured workflows, the success rate reached 83% for a single interaction, demonstrating that clarity of steps improves performance.
Regarding privacy management, the AI showed significant gaps, except when a specific incentive was provided, which reduced the task success rate.

In summary, a high-performing agent, left to its own devices in a realistic and complex work environment, fails two-thirds of the time.

DELEGATE-52: The Risks of Silent Corruption

So far, assessments have focused on the success rate of agents. However, a study conducted by Microsoft Research in April 2026, titled DELEGATE-52, explored a more concerning aspect: the ability of agents to cause harm when operating autonomously over extended periods. This study revealed that...

Autonomous Agents: AI's Promises Against Real-World Challenges

Le brief IA que les pros lisent chaque soir

The Promise of Autonomous Agents

CRMArena-Pro: A Revealing Benchmark

DELEGATE-52: The Risks of Silent Corruption

Brief IA — L'actualité IA en français