GPT-5.5 Boosted by 23 Points with a Simple Markdown File

⚡

Key Takeaways

1A 1,400-token Markdown file improved GPT-5.5 by 23.5 points without adjusting the model weights.

2SkillOpt optimizes a skill file in Markdown, enhancing the model's performance across various benchmarks.

3The gains are notable on procedural tasks like SpreadsheetBench, demonstrating the effectiveness of SkillOpt.

💡Why it matters — This method offers a cost-effective and efficient alternative to traditional fine-tuning of AI models.

A Markdown File Revolutionizes GPT-5.5

In a surprising experiment, a Markdown file of just 1,400 tokens significantly improved the performance of GPT-5.5. Without altering the model weights, a simple integration of this file into the context window was enough to raise the model's average across six benchmarks from 58.8 to 82.3. This increase of 23.5 points was achieved with a text file accessible via any standard editor.

The Concept Behind SkillOpt

The article then explores the concept of SkillOpt, which is based on the idea of treating a skills document in Markdown as a state that can be trained while keeping the target model unchanged. A more powerful optimization model is used during training to propose limited modifications, such as adding, deleting, or replacing content. These modifications are only accepted if they improve a predefined validation score, drawing inspiration from the principles of stability in gradient descent within the text space.

Impressive Results Across Various Benchmarks

The results of the study are based on 52 different combinations of models, benchmarks, and harnesses. SkillOpt proved to be the best or tied for the best in all these combinations. In particular, GPT-5.5 saw its direct chat performance improve from 58.8 to 82.3 (+23.5 points), with particularly notable gains on procedural and format-verified tasks, such as SpreadsheetBench.

Trained Skills in Action

The author describes the "trained" skills that result from this process. These include rules for checking structure, writing explicitly evaluated values, tracking state in embodied navigation, and anchoring responses to the correct row in a table. Interestingly, these improvements can stem from a few accepted modifications and a relatively small artifact size.

Reproducing the Process with SkillOpt

For those looking to reproduce this workflow, the article provides a practical setup. This includes installing SkillOpt, configuring the backends, running the training loop, and deploying by prefixing the learned Markdown to the model's context.

SkillOpt-Sleep: An Innovative Extension

The article also mentions SkillOpt-Sleep, a plugin-style extension that learns from a user's past transcriptions. This extension utilizes an offline consolidation loop validated by review and adoption, thus offering a new dimension to the model's learning.

Limitations and Perspectives

Finally, two limitations are addressed: the reliance on automatic scoring judges and the fact that optimization focuses on one document at a time. The article concludes by emphasizing that for procedural and verifiable agent tasks, training the document rather than the model proves to be a more reliable and cost-effective optimization method than traditional fine-tuning.