Brief IA

OpenAI, Google, Anthropic: Prioritizing AI Speed

🤖 Models & LLM·Tom Levy·

OpenAI, Google, Anthropic: Prioritizing AI Speed

OpenAI, Google, Anthropic: Prioritizing AI Speed
Key Takeaways
1OpenAI, Google, and Anthropic are focusing on the generation speed of their AI models.
2The emphasis is no longer solely on the intelligence of the models, but on their execution speed.
3AI publishers are redefining their priorities to meet new market expectations.
💡Why it mattersThe focus on AI speed could change competitive dynamics in the tech industry.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

OpenAI, Google, Anthropic: Speed of AI Takes Priority

The generation speed of models is becoming a priority for AI publishers. OpenAI, Anthropic, Google... The guiding principle for these publishers is no longer solely the intelligence of the models.

What if the raw power of models is no longer the only criterion for the success of a good LLM? With the era of agentic AI, publishers of generative models are engaged in a new battle: the battle for speed. The goal is no longer to deliver the best intelligence at all costs, but rather to provide fast intelligence. The aim: to produce more, faster, across all domains, from coding to research and scientific discovery, as well as process automation. A race within the race that all model publishers are engaged in.

Intelligence is Becoming a Commodity

In May 2026, Anthropic will release Claude Opus 4.8, its most powerful model. Price: $5 per million tokens in input, $25 for output. Exactly the same price as GPT-5.5, OpenAI's flagship model. Eighteen months ago, the high-end model from one publisher often cost several times more than that of a competitor. That gap has disappeared. At the top, prices have aligned, and benchmark scores have too: in reasoning, coding, or agentic tasks, Opus 4.8, GPT-5.5, and Gemini are within a few points of each other. “We are starting to move towards a common intelligence gap,” confirms Hamidou Dia, VP of Applied AI Engineering at Google Cloud.

Since the advent of agents, primarily for coding, users are no longer just conversing with models; they are making them work in the real world. The model does not respond in one go: it plans, executes, verifies, corrects, and restarts as many times as necessary. Each step triggers a call to the model, and each call waits for the previous one to finish. As a result, latencies accumulate.

Publishers have understood this, and since the beginning of the year, we have seen a series of announcements focused on speed:

  • January 14: OpenAI unveils a partnership with Cerebras to add 750 MW of “ultra-low latency” computing to its platform.
  • February: Anthropic launches its fast mode: the same intelligence of Opus, but 2.5 times more tokens per second, at an additional cost.
  • March 5: OpenAI's Codex rolls out its fast mode. With /fast, GPT-5.4 operates 1.5 times faster, with identical intelligence and reasoning.
  • May 19: at I/O, Google releases Gemini 3.5 Flash, marketed as “four times faster than other frontier models” in its category while outperforming its own Gemini 3.1 Pro on agentic benchmarks.

An Absolute Priority for Engineering Teams

At OpenAI, the priority is acknowledged without hesitation. “We have been focused on speed for a while,” asserts Thibault Sottiaux, Head of Core Product & Platform at OpenAI, who works on Codex. The publisher even admits to a less-than-flattering starting point. “Six months ago, everyone said Codex was slow and unusable. We said okay, let’s tackle that by going back to the fundamentals,” he confides. And the work is far from over: “There are a large number of new techniques that we are currently scaling and on which we have not yet communicated.”

The same position is held at Google, where latency is now a “criterion that teams are hyper-focused on,” emphasizes Hamidou Dia. The publisher has also adjusted its hardware accordingly: “We now have a TPU dedicated to training and a TPU dedicated to inference,” he recalls. The directive even extends to development teams. “Within Google, in some of our development teams, we have a latency budget. For example, you are asked to optimize a process to 10 seconds, and if you manage to do it in 5, you have 5 seconds that you can spend elsewhere,” explains Hamidou Dia.

Anthropic is also considering speed at multiple levels: a range that goes from the lightweight and fast Haiku model to Opus and its paid fast mode. Because the demand comes from clients, whose perspective has changed. “Many users are looking for the best intelligence per dollar and per second,” reminds Katelyn Lesse, Head of Engineering at Claude Platform. But the issue goes beyond just adjusting a slider. For certain capabilities, speed fundamentally conditions usage. “There are areas, particularly computer use, where an acceleration compared to current speeds is really necessary,” illustrates Angela Jiang, Head of Product at Claude Platform.

New Metrics Are Emerging

A similar movement is sweeping across the entire industry: it is no longer just the model that is being optimized, but the entire chain. And the boundaries between teams are shifting. Where model development and infrastructure engineering used to progress in parallel, they now work hand in hand. At OpenAI and Anthropic, the teams designing the harness, the software layer, collaborate closely with those training the models. Google pushes the envelope even further: its silicon teams design the TPUs in close collaboration with those at DeepMind.

However, speed is just one metric among those that publishers are now scrutinizing. Because the challenge in the coming years will not only be to produce quickly but to produce without burning through all their cash. Dario Amodei, the head of Anthropic, recently reminded us: if revenue growth slips even by a year against the amounts spent on compute, a publisher can find itself on the brink of bankruptcy. In this context, a new ratio is becoming prominent in discussions: intelligence per watt. A new race is on the horizon.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.