Brief IA

Kimi K2.6 Surpasses GPT-5.4 and Claude: An AI Turning Point?

🤖 Models & LLM·Tom Levy·

Kimi K2.6 Surpasses GPT-5.4 and Claude: An AI Turning Point?

Kimi K2.6 Surpasses GPT-5.4 and Claude: An AI Turning Point?
Key Takeaways
1Moonshot AI has launched Kimi K2.6, an open-weight model with 1 trillion parameters, surpassing GPT-5.4 and Claude Opus 4.6 on coding benchmarks.
2K2.6 utilizes a Mixture-of-Experts architecture, with a context window of 256,000 tokens, and offers faster inference through INT4 quantization.
3The weights of K2.6 are available under a Modified MIT license, but training remains closed, limiting independent reproduction.
💡Why it mattersKimi K2.6 could redefine the competitiveness of open models against closed giants, influencing the future of AI in China and beyond.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

Moonshot AI Establishes Itself with Kimi K2.6, a Revolutionary Model

Moonshot AI recently unveiled Kimi K2.6, a language model impressive for its 1 trillion parameters. This model is specifically designed to excel in coding and managing autonomous agents. In terms of performance, Kimi K2.6 outperforms renowned models such as Claude Opus 4.6 and GPT-5.4 across several coding benchmarks. Notably, its weights are made available to the community on the Hugging Face platform under a Modified MIT license, facilitating access and experimentation.

Kimi K2.6 was launched just three months after its predecessor, K2.5, and shows significant improvements across all evaluated categories. By 2026, Moonshot AI is recognized as the leader among Chinese labs in the field of open models, according to the Artificial Analysis report. On the SWE-Bench Pro benchmark, which assesses the ability to solve real-world problems sourced from GitHub, K2.6 achieved a score of 58.6 points, surpassing GPT-5.4, which scored 57.7, and Claude Opus 4.6 with 53.4. Additionally, on DeepSearchQA, K2.6 reached 83.0, while Claude Opus 4.6 and GPT-5.4 scored 80.6 and 63.7, respectively. Finally, on Terminal-Bench 2.0, K2.6 scored 66.7, placing it ahead of its closed competitors.

An Innovative Architecture for Optimized Performance

Technically, K2.6 stands out with its Mixture-of-Experts architecture, which comprises 1 trillion parameters in total, but only 32 billion are activated per token. This allows for a computational cost per token comparable to that of a medium-sized dense model. The model's context window reaches 256,000 tokens, significantly more than what most current models offer. Thanks to INT4 quantization, which was integrated from the initial training, K2.6 provides inference speeds approximately twice that achieved with FP16 precision. According to AllThings.how, K2.6's performance is very close to that of a fully precise model, with a margin of only 1-2%.

K2.6 is compatible from its release with several frameworks and platforms, including vLLM, SGLang, KTransformers, and OpenRouter. It can be integrated via an endpoint compatible with the SDKs of OpenAI and Anthropic, thus facilitating its adoption by developers.

Internal Evaluations and Lack of External Validation

Most evaluations of K2.6 come directly from Moonshot AI, which used its own internal framework derived from SWE-agent. The tests were conducted with a temperature set to 1.0, an average calculated over ten runs, and a context of 262,144 tokens. Moonshot AI also developed internal benchmarks, such as "Kimi Code Bench" and "Claw Bench," to assess K2.6's performance. The scores of GPT-5.4 and Claude Opus 4.6, marked with an asterisk in the official results, were re-evaluated by Moonshot under the same conditions due to the lack of comparable public data.

For the DeepSearchQA benchmark, the scores of the models from Anthropic and OpenAI come from the official System Card of Anthropic, conducted in a different experimental framework than that used for K2.6. To date, no independent reproduction has confirmed all these results. However, in Moonshot's official post, several companies such as Vercel, Augment Code, Baseten, and Ollama reported improvements over K2.5 in their environments, although these claims are not directly compared to closed models.

A Partial Openness That Raises Questions

Although Moonshot AI has made the weights of K2.6 accessible under a Modified MIT license on Hugging Face, the training of the model remains closed. The THIRD_PARTY_NOTICES file indicates that the architecture uses modeling code from DeepSeek-V3, also under an MIT license. However, the training data, as well as the complete recipe and evaluation pipeline, are not published. This prevents any independent verification or reproduction of the training, distancing K2.6 from the strict definition of "open-source" according to the Open Source Initiative.

In practice, while the weights are available, the autonomous deployment of K2.6 remains costly. In INT4 version, the model weighs approximately 594 GB and requires at least four H100 80 GB GPUs to operate. In FP16 version, its size exceeds two terabytes. According to AllThings.how, the cost of cloud infrastructure for an INT4 node ranges between $8,000 and $12,000 per month, making the Moonshot API more economical for usages below five billion tokens per month.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.