OpenAI Unveils GPT-5.5: A More Autonomous and Powerful AI

⚡

Key Takeaways

1OpenAI launches GPT-5.5, an advanced AI model capable of planning and verifying its own results.

2GPT-5.5 outperforms its predecessors with an 82.7% success rate on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro.

3The model excels in various domains, achieving 84.9% on GDPval and 98% on Tau2-bench Telecom.

💡Why it matters — GPT-5.5 could transform various sectors due to its autonomy and extensive capabilities, but it also raises crucial security concerns.

OpenAI Unveils GPT-5.5, a More Autonomous and Efficient AI

OpenAI has recently unveiled GPT-5.5, its most advanced artificial intelligence model to date. This model stands out for its increased autonomy and enhanced efficiency, redefining the capabilities of AI in everyday life. Alongside the impressive launch of ChatGPT Images 2.0, OpenAI announced the gradual rollout of GPT-5.5, which is not just a cosmetic update. This model is capable of planning, using tools, verifying its results, and persisting in the face of ambiguity without requiring constant assistance.

GPT-5.5 is particularly ambitious in various fields, ranging from agentic coding to scientific research. OpenAI seems to be taking a bolder approach than ever, offering a model that could transform how AI is used in daily life.

Impressive Performance of GPT-5.5

In the realm of software development, GPT-5.5 stands out with its remarkable performance. On the Terminal-Bench 2.0 benchmark, which simulates complex task sequences driven by command line, GPT-5.5 achieves a success rate of 82.7%. Additionally, on SWE-Bench Pro, which subjects the model to real bug-fixing requests from actual GitHub projects, it manages to resolve 58.6% of cases. These results surpass those of GPT-5.4 while using fewer resources.

User testimonials are also revealing. Dan Shipper, founder of the company Every, confronted GPT-5.5 with a serious bug that arose after the launch of an application. He and an experienced engineer had spent several days understanding the problem before rewriting part of the code. GPT-5.4 was unable to handle it, but GPT-5.5 proposed exactly the same rewrite that the engineer ultimately arrived at. Shipper describes GPT-5.5 as "the first coding model I've used that has real conceptual clarity."

Engineers note that GPT-5.5 does not merely execute instructions. It understands the overall structure of a project, identifies problems upstream, suggests what needs to be tested, and ensures that changes remain consistent throughout the project without requiring step-by-step guidance. An engineer from NVIDIA, who had early access to GPT-5.5, explains that "losing access to GPT-5.5 feels like losing a limb."

GPT-5.5 Beyond Software Development

GPT-5.5 is not limited to software development. On GDPval, a benchmark that evaluates the quality of work produced in 44 different professions, it scores 84.9%. On OSWorld-Verified, which tests whether a model can autonomously use a real computer, it achieves a success rate of 78.7%. Finally, on Tau2-bench Telecom, which simulates complex exchanges in customer service, it reaches an accuracy of 98%, without any prior adaptation of its initial instructions.

For OpenAI, the most compelling evidence does not come from rankings but from real internal usage. The company's finance team entrusted Codex with processing nearly 25,000 U.S. tax forms, totaling over 71,000 pages of documents. As a result, two weeks were gained on the usual timeline. The sales team has automated its weekly reports and recovers up to ten hours of work per week. In total, over 85% of OpenAI employees now use Codex every week.

Scientific Advances with GPT-5.5

The GPT-5.5 model shows significant progress on advanced scientific benchmarks like GeneBench, which assesses its capabilities in genetics, and BixBench, focused on biological data analysis. An interesting detail is that an internal version of the model contributed to demonstrating a long-unsolved mathematical conjecture related to Ramsey numbers. The proof was then formally verified by specialized software called Lean, used by mathematicians to ensure that reasoning is rigorously accurate. The AI is no longer just assisting researchers but is contributing to advancing knowledge.

Risks and Security Surrounding GPT-5.5

OpenAI is clearly taking the issue of risks seriously. GPT-5.5 is officially classified at a high danger level in terms of cybersecurity, according to the company's internal assessment grid, the Preparedness Framework. This means that the model is capable enough to potentially be misused for malicious purposes. OpenAI has therefore deployed stricter detection filters, reserved expanded access for verified cybersecurity professionals, and engaged in discussions with governments regarding the protection of sensitive infrastructures.

Pricing and Accessibility of GPT-5.5

For developers who will integrate GPT-5.5 into their applications via the API, billing will be consumption-based, meaning $5 per million tokens for input and $30 per million tokens for output. The Pro version is significantly more expensive, with amounts of $30 and $180 respectively. These prices may seem high at first glance, but OpenAI assures that GPT-5.5 requires fewer tokens than GPT-5.4 to accomplish the same tasks.

GPT-5.5 is available today for Plus, Pro, Business, and Enterprise subscribers, both in ChatGPT and Codex. In the latter, the model can handle very long documents and exchanges thanks to a context window of 400,000 tokens. The even more powerful Pro version is reserved for Pro, Business, and Enterprise plans. As for developers who wish to integrate the model into their own applications, API access is expected to be announced very soon.

One question remains that benchmarks do not answer: as OpenAI rolls out more powerful, more autonomous models accessible to a wider audience, the promise of safeguards "keeping pace" will be difficult to uphold and, above all, hard to verify from the outside. GPT-5.5 may be the most impressive model OpenAI has ever released. But the company ultimately remains the sole judge of what its own model is capable of doing wrong.