Brief IA

Bridgewater: An Open-Source AI Model Outperforms GPT and Claude

💻 Code & Dev·Tom Levy·

Bridgewater: An Open-Source AI Model Outperforms GPT and Claude

Bridgewater: An Open-Source AI Model Outperforms GPT and Claude
Key Takeaways
1Bridgewater and Thinking Machines Lab tested AIs on financial document analysis.
2A finely tuned open-source model outperformed GPT and Claude at a lower cost.
3The results show that proprietary AIs do not always dominate specialized tasks.
💡Why it mattersThis could influence the technological choices of companies seeking cost-effective and efficient AI solutions.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

Bridgewater: An Open-Source AI Model Surpasses GPT and Claude

The Qwen3-235B model, developed by Bridgewater and Thinking Machines Lab, has been trained to analyze financial documents and outperforms leading commercial models. This model, refined with internal expert knowledge, achieves an accuracy of nearly 85% in tests and costs 14 times less to operate.

This demonstrates that companies can develop powerful AI solutions using their own data without having to share sensitive information with large providers.

The hedge fund Bridgewater and Thinking Machines Lab assert that a fine-tuned open-weight model surpasses the most advanced AI models in evaluating financial documents, and this is achieved at a fraction of the cost. The figures come from their own internal assessment.

Investors are inundated with news, analyses, corporate filings, and emails every day. According to a report from AIA Labs of Bridgewater and Thinking Machines Lab, reading is not the real work. The real work consists of a constant stream of small, repeated decisions about what is truly important. It is this task that the researchers aimed to automate.

They defined six tasks drawn from an investor's daily routine. One example: deciding whether a financial article is relevant to an executive. Another: determining if a central bank document signals the direction of future rate changes. For investors, these decisions are trivial, but they struggle to articulate their reasoning. The report provides a telling example: a headline about Trump's claim regarding Greenland is deemed irrelevant, while Trump's threat to impose new tariffs on China is considered highly relevant. Both touch on geopolitics and finance.

The leading models failed in the authors' tests. Variants of Gemini, Claude, and GPT only achieve about 50% accuracy with a basic prompt. Expert-written instructions and a three-tiered rating system ("relevant and interesting," "relevant but not interesting," "not relevant") raised accuracy to the 70% range. This still falls short of the 80% threshold the authors had set for reliable deployment.

When experts write the prompt, performance significantly improves compared to a naive prompt.

The new models hardly improve the cost-effectiveness ratio, according to the report. GPT 5.4 costs 43% more than 5.2, yet is only marginally more accurate.

The True Value Lies in the Minds of Investors

The solution was fine-tuning, which involved retraining an open-weight model on proprietary examples. The key ingredient was the judgment of Bridgewater's investors: initially, low-cost external contractors labeled the documents, but many of these labels were incorrect. To avoid having expensive professionals revise everything, the researchers used a workaround. An initial model learned from the faulty labels and re-evaluated the same documents. Wherever the model and the original label disagreed, there was likely an error. Only these disputed cases were submitted to investors for correction.

The training took place on the Tinker platform of Thinking Machines Lab, built on the open model Qwen3-235B. In the team's evaluation, the fine-tuned model achieved 84.7% accuracy compared to 78.2% for the best tested leading model. It also cost nearly 14 times less to operate. This is not a truly independent comparison, of course. Both companies have a clear interest in selling their product.

However, the discovery beyond the numbers is worth noting. It once again shows that large labs like OpenAI have not absorbed all available data. Huge pools of proprietary corporate data and untrained human expertise still exist, offering real potential for improvement. This is particularly true when companies deliberately choose to keep their most valuable data private. Anyone who hands over this data to a leading lab risks competing with a product built on that foundation.

Fine-tuning open models using tools like Tinker offers companies an alternative. They retain the weights, the data, and, depending on the setup, the GPUs themselves.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.