Brief IA

Microsoft Reveals the Limitations of AI Agents in 52 Professions

🤖 Models & LLM·Tom Levy·

Microsoft Reveals the Limitations of AI Agents in 52 Professions

Microsoft Reveals the Limitations of AI Agents in 52 Professions
Key Takeaways
1Microsoft Research tested 19 AI models across 52 domains, revealing a 25% loss of content after 20 interactions.
2The DELEGATE-52 benchmark shows that even the most advanced AI models, such as GPT-5.4, fail to maintain document integrity.
3AI agents that perform well on simple tasks often struggle with complex tasks, requiring constant human supervision.
💡Why it mattersThese results highlight the current limitations of AI in complex professional environments, impacting their adoption in businesses.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

Microsoft recently highlighted the shortcomings of artificial intelligence agents across a wide range of professional scenarios. This revelation comes from a study conducted by Microsoft Research, which demonstrated that even the most advanced AI models struggle to maintain document integrity after multiple exchanges.

Entrusting a working document to an AI for twenty interactions can result in the loss of a quarter of its content. This finding, although shared by many users, is now confirmed by Microsoft itself. The study shows that AI assistants, when tasked with activities such as correcting tables or modifying paragraphs, ultimately lose crucial information over the course of interactions.

DELEGATE-52: an uncompromising benchmark

The study, led by Philippe Laban, Tobias Schnabel, and Jennifer Neville, is named DELEGATE-52. It is based on a simple principle but yields complex results. The researchers designed 310 work environments covering 52 professional fields. These fields include a variety of tasks ranging from Python coding to accounting, as well as musical notation, crystallography, and financial statements.

Each environment consists of approximately 15,000 tokens and includes five to ten complex editing tasks. The testing protocol is based on a back-and-forth process: the model modifies a document and then must undo its own modification. After ten cycles, or twenty interactions, the goal is for the document to return to its original state. Nineteen models were tested, including current leaders such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4. The study's data is publicly available on GitHub and Hugging Face, allowing anyone to verify the results.

Results: significant content loss

The results of the study are clear: after twenty interactions, the best-performing models corrupt an average of 25% of the document content. This figure does not distinguish between premium models and open-source models. The most advanced models delay the emergence of errors but do not avoid them. For Microsoft, whose Copilot is already struggling to convince with only 3.3% paid adoption, publishing these results reflects a degree of transparency.

Analysis of model performance

Two main conclusions emerge from this study. First, a model's performance after two interactions does not predict its behavior after twenty. A model may appear effective during the initial exchanges but collapse over subsequent interactions. Second, integrating the model into an autonomous agent like Copilot Cowork does not improve its performance. Data corruption stems from the model itself, not from its execution framework.

The DELEGATE-52 study is not an isolated case. Other recent benchmarks, such as YC-Bench, UltraHorizon, and Terminal-Bench, converge on the same finding: AI agents lose track beyond a few dozen exchanges. However, Microsoft's study stands out for its breadth, covering 52 fields instead of just one, and for the transparency of its protocol.

The areas where AI performs best are those governed by strict rules, such as Python, SQL, and databases. Conversely, it often fails in domains that mix format, semantics, and human conventions, such as financial statements, musical scores, and textile patterns. These types of documents are commonly handled in offices. For the millions of employees using Copilot, ChatGPT, or Claude at work, the message is clear: for short tasks, AI remains reliable, but for long editing chains, human supervision is essential. Companies like Meta and Cloudflare could experience this unpleasantly.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.