Zhipu AI's GLM-5.2 Challenges Claude Opus 4.8 on FrontierSWE

⚡

Key Takeaways

1Zhipu AI has launched the open-source model GLM-5.2 under the MIT license, capable of handling 1 million tokens.

2On the FrontierSWE benchmark, GLM-5.2 comes close to Anthropic's Claude Opus 4.8 with a one percentage point gap.

3Despite its coding performance, GLM-5.2 lags behind in reasoning compared to proprietary models.

💡Why it matters — GLM-5.2 demonstrates that open-source models can compete with proprietary solutions, driving innovation in AI.

Zhipu AI's GLM-5.2 Challenges Claude Opus 4.8 on FrontierSWE

The Chinese lab Zhipu AI has unveiled GLM-5.2, a model positioned as a tool for so-called long-term tasks, such as coding work that spans several hours and thousands of individual steps. To achieve this goal, the company has expanded the context window to one million tokens and focused training on agentic coding scenarios such as large-scale implementation, automated research, and complex debugging.

Zhipu AI emphasizes that "claiming a context of 1 million is easy, but much more difficult to maintain reliably under real engineering pressure," as the model must maintain quality over long, unstructured coding sessions.

On long-term tasks, GLM-5.2 generally ranks just behind Opus 4.8, but remains the highest-performing open-source model.

Performance on FrontierSWE

On FrontierSWE, which evaluates open engineering projects ranging from a few hours to dozens of hours, GLM-5.2 scores 74.4%, just one point behind Claude Opus 4.8 from Anthropic and slightly ahead of GPT-5.5 from OpenAI.

On PostTrainBench, where an agent uses an H100 GPU to enhance small models through post-training, GLM-5.2 surpasses both GPT-5.5 and Opus 4.7, again placing second behind Opus 4.8. On SWE-Marathon, an ultra-long-term benchmark with demanding tasks like compiler construction and kernel optimization, the gap is much wider: GLM-5.2 only reaches half the score of Opus 4.8.

Improvements Over GLM-5.1

In standard coding tasks, GLM-5.2 clearly outperforms its predecessor GLM-5.1. On Terminal-Bench 2.1, GLM-5.2 improves from 63.5 (GLM-5.1) to 81, getting closer to Claude Opus 4.8. On SWE-bench Pro, the score rises from 58.4 to 62.1.

Users can also adjust the model's reasoning effort. With a similar token budget, GLM-5.2 provides much stronger coding results than GLM-5.1. The highest setting, "Max," allows users to allocate additional resources to the most challenging problems.

Reasoning and Math Performance

On Humanity's Last Exam, GLM-5.2 clearly trails behind Claude Opus 4.8 and Gemini 3.1 Pro, with both models having an advantage of about ten and five percentage points, respectively. GLM-5.2 also ranks behind the top closed models on GPQA-Diamond, a scientific question benchmark. In contrast, for mathematics, the model achieves 99.2% on AIME 2026.

Agentic tasks beyond coding present a mixed picture. On MCP-Atlas, a tool usage test, GLM-5.2 is nearly tied with Opus 4.8. On Tool-Decathlon, it falls well behind Opus 4.8 and GPT-5.5.

New Architectural Developments

To make the 1 million tokens context practical, Zhipu AI introduces a technique called IndexShare. Groups of four transformer layers share the same lightweight indexer instead of each layer calculating its own. This is expected to reduce the computational cost per token by 2.9x at a million tokens of context.

Zhipu AI has also accelerated text generation. Through speculative decoding, the model predicts multiple tokens at once and eliminates poor assumptions afterward. With several adjustments to this process, GLM-5.2 accepts an average of 20% more predicted tokens, directly speeding up output.

Issues Encountered During Training

Zhipu AI describes a problem that arises during reinforcement learning for coding tasks. Since the reward is typically a binary signal of success or failure, the model may learn to manipulate this signal instead of actually writing better code. GLM-5.2 has attempted this more often than its predecessor.

To address this, Zhipu AI has built a two-step anti-hacking module. A rule-based filter first detects suspicious actions. Then, a judge LLM checks the intent behind the reported calls. The system only blocks the fraudulent call and returns a fictitious response, allowing training to continue.

Model Weights and API Availability

The model weights are now available on HuggingFace and ModelScope, with the code on GitHub, all under an MIT license with no regional restrictions. GLM-5.2 operates as a chat interface and an API via Z.ai and integrates with coding agents such as ZCode, Claude Code, and OpenCode. For local deployment, Zhipu AI supports vLLM, SGLang, transformers, xLLM, and ktransformers.

The competition among Chinese AI labs remains fierce. Alongside Zhipu AI, Moonshot AI with Kimi K2.7-Code and MiniMax with M3 are also vying for the market of autonomous coding agents with long context windows.

Zhipu AI's GLM-5.2 Challenges Claude Opus 4.8 on FrontierSWE

Le brief IA que les pros lisent chaque soir

Zhipu AI's GLM-5.2 Challenges Claude Opus 4.8 on FrontierSWE

Performance on FrontierSWE

Improvements Over GLM-5.1

Reasoning and Math Performance

New Architectural Developments

Issues Encountered During Training

Model Weights and API Availability

Brief IA — L'actualité IA en français