Brief IA

MirrorCode: Claude Opus 4.7 Excels but Struggles with Complexity

🤖 Models & LLM·Tom Levy·

MirrorCode: Claude Opus 4.7 Excels but Struggles with Complexity

MirrorCode: Claude Opus 4.7 Excels but Struggles with Complexity
Key Takeaways
1The MirrorCode benchmark from Epoch AI assesses the ability of AIs to recreate programs without source code.
2Claude Opus 4.7 achieved a success rate of 56%, recreating 16,000 lines of code in 14 hours.
3The models fail on complex tasks, despite continuous execution for 19 days costing $2,600.
💡Why it mattersThe ability of AIs to recreate code without direct access could transform software development, but complex challenges remain a major obstacle.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

MirrorCode: Claude Opus 4.7 Excels but Struggles with Complexity

An AI model was run continuously for 19 days on a single MirrorCode task, costing $2,600 to execute.

Epoch AI and METR have developed a new benchmark called "MirrorCode" that requires AI models to recreate complete programs in various areas of computer science from scratch, without access to the original source code.

Claude Opus 4.7 stands out in this benchmark with a success rate of 56% and managed to reimplement a set of bioinformatics tools consisting of 16,000 lines of code in just 14 hours.

While all tested models can reliably handle smaller programs, none have yet succeeded in solving the most complex tasks.

The new coding benchmark MirrorCode from Epoch AI and METR tests whether AI models can autonomously recreate entire programs. Claude Opus 4.7 leads with 56%, but each model still fails on the most complex tasks.

In the MirrorCode benchmark, AI models must reimplement complete programs from scratch without access to the original source code. The 25 target programs cover Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each AI-generated solution must exactly reproduce the output of the original program, including hidden end-to-end tests that the model never sees during development.

Another difference from many other benchmarks is the inference budget. Existing software engineering benchmarks often limit costs to $1 to $10 per task, even when a human would need weeks to complete the same work, according to developers.

According to Epoch AI, one of the largest tasks in MirrorCode cost $2,600 for a single execution. The AI worked continuously for 19 days without any human intervention.

Claude Opus 4.7 Rebuilds a Set of Bioinformatics Tools in 14 Hours

Epoch AI claims that the AI can already handle demanding long-term programming tasks. A notable example comes from Claude Opus 4.7, which reimplemented gotree, a set of bioinformatics tools of about 16,000 lines of Go code and over 40 commands. A human engineer working without AI assistance would need 2 to 17 weeks for the same work, according to researchers. Opus 4.7 completed it in 14 hours for $251.

Claude Opus 4.7 leads the MirrorCode benchmark with a success rate of 56%, followed by GPT-5.5 at 44% and Gemini 3.1 Pro Preview at 32%. Even when models fail to fully reimplement a program, they generally succeed in 90% or more of the tests.

The Most Difficult Tasks Remain a Challenge for Every Model

Despite the progress, MirrorCode is far from being solved. The tasks are divided into three categories: small, medium, and large. Small programs like uuid or parseqsv are reliably reimplemented by all tested models. The larger tasks surpass every tested model.

Researchers are still observing rapid gains. The leading models from a year ago would have achieved only about 30% and would have been limited to simpler programs like a calendar utility, according to Epoch AI.

Cost trends do not follow a clear pattern. GPT-5.5 costs three times more than GPT-5 for the same tasks, while Claude Opus 4.7 operates three times cheaper than Claude Opus 4.1.

Epoch AI has open-sourced the framework and 22 of the 25 target programs, covering 132 task instances in six programming languages. Three programs are kept private for testing.

Researchers highlight an important point: since MirrorCode uses open-source programs as targets, the models may have already seen the original code during their training. Initial tests suggest that "the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance."

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.