Brief IA

LLM: Six Months of Disruptions and Technological Innovations

🛠️ AI Tools·Tom Levy·

LLM: Six Months of Disruptions and Technological Innovations

LLM: Six Months of Disruptions and Technological Innovations
Key Takeaways
1In November 2025, Claude Opus 4.5 surpassed its competitors, becoming the most powerful LLM model.
2The coding agents from OpenAI and Anthropic reached a crucial milestone, becoming reliable tools for daily work.
3OpenClaw, a personal AI assistant, emerged in February 2026, generating excitement for Mac Minis.
💡Why it mattersThese advancements are transforming the use of LLMs, making AI more accessible and functional for developers and the general public.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

A Retrospective of the Last Six Months of LLMs

At PyCon US 2026, a lightning presentation attempted to summarize the last six months of developments in large language models (LLMs). This period was marked by a turning point in November 2025, crucial for advancements in programming.

Six months is a convenient timeframe to cover, as it encompasses what is referred to as the November 2025 turning point. November was a critical month for LLMs, particularly in programming. To start, the supposedly "best" model (according to impressions) changed hands five times among the three major providers.

The Evolution of Language Models

Over the past six months, the model considered the "best" changed hands five times between the three main providers. In November, Claude Sonnet 4.5, released on September 29, was at the top. However, it was quickly surpassed by GPT-5.1, then by Gemini 3, followed by GPT-5.1 Codex Max, before Anthropic reclaimed the crown with Claude Opus 4.5. A test generating an SVG of a pelican riding a bicycle was used to illustrate the differences between these models, as drawing a pelican on a bike is a complex and unconventional task.

Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can't ride bikes... and there's no chance an AI lab would train a model for such a ridiculous task.

At the beginning of November, the model widely recognized as the "best" was Claude Sonnet 4.5, released on September 29. It drew me this pelican. In November, it was surpassed by GPT-5.1, then Gemini 3, followed by GPT-5.1 Codex Max, and finally Anthropic reclaimed the crown with Claude Opus 4.5.

I think Gemini 3 drew the best pelican of the bunch, but pelicans aren't everything. Most practitioners agree that Opus 4.5 held the crown in the following months.

Improvement of Coding Agents

The real news in November was the improvement of coding agents. OpenAI and Anthropic dedicated much of 2025 to reinforcement learning from verifiable rewards to enhance the quality of code generated by their models, particularly with the Codex and Claude Code agents. In November, these agents crossed a quality threshold, moving from "often functional" to "mostly functional," making them usable for everyday tasks without requiring constant corrections.

It took some time for this to become clear, but the real news in November was that coding agents had improved. OpenAI and Anthropic had spent most of 2025 executing reinforcement learning from verifiable rewards to increase the quality of code written by their models, especially when paired with their Codex and Claude Code agents.

In November, the results of this work became evident. Coding agents moved from "often functional" to "mostly functional," crossing a quality barrier that allowed them to be used as everyday tools for real work, without having to spend most of one’s time correcting their silly mistakes.

Personal Projects and Innovations

During the holiday season, many developers explored these new models and coding agents. Some, like the presenter, launched ambitious projects to test the limits of these technologies. A notable example is the micro-javascript project, an implementation of JavaScript in Python, called micro-javascript, which runs in a browser via Pyodide and WebAssembly. While this project was an impressive technical demonstration, it was not necessarily practical or secure.

Also in November, the first commit occurred in an obscure repository (at the time) called "Warelay" by a certain Pete. During the holiday period, from December to January, many of us took advantage of the break to explore these new models and coding agents and see what they could do.

They could do a lot! Some of us got a bit too carried away. I myself had a brief period of LLM-related psychosis, starting to launch very ambitious projects to see how far I could push them.

One of my projects was an ambiance-coded implementation of JavaScript in Python — a free port of MicroQuickJS — which I called micro-javascript. You can try it in your browser in this playground.

This demo in the playground shows JavaScript code executed using my micro-javascript library, in Python, running inside Pyodide, turning in WebAssembly, executing in JavaScript, in a browser!

It's pretty cool! But did anyone need a buggy, slow, and insecure implementation of JavaScript in Python? No, not at all. I have several other projects from this holiday period that I have since quietly removed!

The Rise of OpenClaw

In November, an obscure repository called "Warelay" saw its first commit. This project evolved into OpenClaw, a personal AI assistant that gained popularity in February 2026. OpenClaw, along with its variants called Claws, captured attention, particularly in Silicon Valley, where Mac Minis were used to run these assistants. A popular metaphor for the Claws is that of the intelligent claws of the Doc Ock character in the movie Spider-Man 2.

Fast forward to February. Do you remember that Warelay project that had its first commit at the end of November? In December and January, it underwent quite a few name changes... and in February, it took the world by storm under its final name, OpenClaw.

The amount of attention it received is quite astonishing for a project less than three months old. OpenClaw is a "personal AI assistant," and we actually got a generic term for these, based on NanoClaw and ZeroClaw... they are called Claws.

Mac Minis began selling like hotcakes around Silicon Valley, as people bought them to run their Claws. Drew Breunig joked with me that it's because they are the new digital pets, and a Mac Mini is the perfect aquarium for your Claw.

My favorite metaphor for the Claws is Alfred Molina's Doc Ock in the 2004 film Spider-Man 2. His claws were powered by AI and were perfectly safe as long as nothing damaged his inhibition chip... after which they became malevolent and took control.

New Models and Demonstrations

In February, Gemini 3.1 Pro was launched, producing a remarkable pelican. Google also impressed with an animated video showing a pelican riding a bicycle, a frog on a tall bike, a giraffe driving a small car, an ostrich on roller skates, a turtle doing a kickflip on a skateboard, and a dachshund driving a stretched limousine. These demonstrations showed that AI labs were finally paying attention to creative details.

Also in February: Gemini 3.1 Pro was released and drew me a very good pelican riding a bicycle. Look at that! It even has a fish in its basket.

And then Jeff Dean from Google tweeted this video of an animated pelican riding a bicycle, plus a frog on a tall bike and a giraffe driving a small car, and an ostrich on roller skates and a turtle doing a kickflip on a skateboard and a dachshund driving a stretched limousine.

So maybe AI labs have finally paid attention!

A lot has happened just last month. Google released the Gemma 4 model series, which are the best-performing open-weight models I've seen from an American company.

Also last month, the Chinese AI lab GLM launched GLM-5.1 — a 1.5 trillion parameter open-weight monster! It's a very efficient model... if you can afford the hardware to run it.

GLM-5.1 drew me this very competent pelican on a bicycle. ... although when it tried to animate it, the bike bounced up and got distorted.

Charles on Bluesky suggested I try with a Virginia opossum on an E-scooter. And it did that! I tried this on other models and they didn't even come close. "Cruising the commonwealth since dusk" is perfect. It's animated too.

The other interesting open-weight Chinese models in April came from Qwen. Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7. It's a 20.9 GB open-weight model that runs on my laptop!

I think this primarily demonstrates that the pelican on the bike has firmly exceeded its limits as a useful benchmark.

Here’s that pelican from Claude Sonnet 4.5 from September for comparison.

Thus, the two main themes of the last six months were as follows: coding agents have truly improved... and the models available on laptops, although much weaker than cutting-edge ones, have started to exceed expectations spectacularly.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.