GPT-5.4 and Claude Opus 4.6 Fail the Bankers' Test

⚡

Key Takeaways

1A study by Handshake AI and McGill shows that current AIs are not ready for clients in investment banking.

2GPT-5.4, although the best of the tested models, fails to meet half of the established criteria.

3BankerToolBench reveals subtle errors in AI deliverables, compromising their use in finance.

💡Why it matters — The results highlight the current limitations of AIs in critical sectors like finance, necessitating significant improvements.

Evaluation of AI Models by Investment Bankers

A recent study conducted by Handshake AI and McGill University tested the capabilities of artificial intelligence models to perform typical tasks of junior investment bankers. The benchmark, titled BankerToolBench, evaluated cutting-edge models such as GPT-5.4 and Claude Opus 4.6 on common tasks. The results are clear: no model is ready for client deliverables.

According to the study, while more than half of the bankers stated they would use AI results as a starting point, 41% of the results require major revision and 27% are deemed unusable. Only 13% of the results could be used with minor modifications, and none are ready to be sent as is.

BankerToolBench: A Rigorous Assessment

BankerToolBench evaluates actual deliverables, such as Excel financial models, PowerPoint presentations, PDF reports, and Word memos. AI agents must navigate data rooms and extract information from platforms like FactSet and Capital IQ. A single task may require up to 539 calls to the language model, with 97% related to tool usage or code execution.

Each deliverable is checked against a grid of 150 criteria covering technical accuracy, client readiness, compliance, auditability, and consistency. An AI verifier, Gandalf, based on Gemini 3 Flash Preview, was used for the assessment, achieving an agreement rate of 88.2% with human evaluators.

Performance of Tested Models

Among the tested models, GPT-5.4 achieved the best results but failed to meet nearly half of the criteria. Only 16% of its results were deemed useful as a starting point. No model produced results ready for submission without modification. For GPT-5.4, only 2% of tasks met all critical criteria.

The results of Claude Opus 4.6 initially seem promising, but the Excel models reveal major flaws, such as hard-coded key figures. This prevents any scenario analysis, a crucial element in investment banking.

Subtle Errors and Implications

The identified subtle errors include inconsistencies in revenue figures and color choices that do not comply with style guides. In one case, an agent fabricated clinical trial data after failing to find information in the SEC database.

A Training Tool and Its Limitations

BankerToolBench can also serve for reinforcement learning. Experiments have shown that the Dr. GRPO and DPO methods can improve performance, although the baseline results are low.

The study highlights the current limitations of AI in critical sectors like finance. The results align with other recent research, indicating that AI agents are not yet ready for complex production tasks. Labs like Anthropic are working to overcome these challenges by integrating enhanced features into their models.

Real Excel Models, Not Just Text Responses

BankerToolBench evaluates actual deliverables that a junior banker would submit to a supervisor: Excel financial models with functional formulas, PowerPoint presentations for client meetings, PDF reports, and Word memos.

Agents must explore data rooms, extract information from market data platforms like FactSet and Capital IQ, and analyze SEC filings. According to the article, a single task can trigger up to 539 calls to the language model, with 97% related to tool usage or code execution.

Each deliverable is checked against a grid designed by bankers, with an average of 150 individual criteria. The criteria cover six areas, including technical accuracy, client readiness, compliance, auditability, and consistency across files.

The evaluation is conducted by an AI verifier that the authors built, called Gandalf, based on Gemini 3 Flash Preview. It agrees with human evaluators 88.2% of the time, slightly above the agreement rate of 84.6% between two human evaluators.

GPT-5.4 on Top, but Far from Ready

The team tested GPT-5.2, GPT-5.4, Claude Opus 4.5 and 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, Grok 4, and the open-source models Qwen-3.5-397B and GLM-5. GPT-5.4 achieved the best results but still failed to meet nearly half of the criteria. Only 16% of its results were deemed a useful starting point. If three consistent executions are required, this figure drops to 13%.

No model result was deemed ready for submission as is. With GPT-5.4, only 2% of tasks met all critical criteria. For Gemini 2.5 Pro, this figure was zero.

Beautiful on the Outside, Broken on the Inside

The results of Claude Opus 4.6 appear polished at first glance, according to the researchers. However, the Excel models reveal a fundamental flaw: most key figures are hard-coded as fixed values rather than calculated by formulas. This poses a major problem in investment banking, as it makes scenario analysis impossible. Changing the purchase price in the model does not update anything. Claude Opus 4.5 had the same issue.

GPT-5.4 scored 58.1 out of 100 overall and outperformed GPT-5.2 in 70% of head-to-head task comparisons. Claude Opus 4.6 and Gemini 3.1 Pro are nearly tied, while Grok 4 and Gemini 2.5 Pro lag far behind.

Subtle Errors That Go Unnoticed

The examples in the article illustrate how subtle these failures can be. In a generated presentation, the verifier notes a revenue figure of $189.5 billion on one slide and $201.0 billion on the next, covering the same period.

In another case, the agent uses Netflix red as an accent color even though the bank's style guide mandates a uniform blue. In a competitive analysis for a pharmaceutical deal, an agent fabricated specific clinical trial data after failing to find information in the SEC database.

A Training Tool as Well

BankerToolBench can also be used for reinforcement learning, according to the authors. In experiments with Qwen-3-4B and 32B, the Dr. GRPO and DPO methods improved benchmark performance by a factor of five to thirteen, although starting from a very low base.

The team highlights several limitations: the benchmark is US-focused, lacks information on confidential transactions, and does not capture iterative teamwork within a real bank. Nevertheless, the authors describe it as one of the most detailed tests to date for assessing whether AI agents can handle demanding knowledge work. For now, the answer is no. The complete benchmark, including data, grids, and the verifier, is publicly available.

The results align with other recent research. A study by Vals.ai conducted with a global systemically important bank found that o3 from OpenAI achieved only 48.3% accuracy on financial analysis tasks. Research from UC Berkeley concluded that teams that manage to get agents working in production rely on simple, tightly controlled setups with few steps. An analysis from Carnegie Mellon and Stanford supports that agent development has focused too narrowly on coding tasks, leaving economically important areas like management, law, and finance largely absent from benchmarks.

Meanwhile, AI labs like Anthropic are working on the weaknesses that BankerToolBench exposes. Anthropic recently introduced a feature allowing Claude to switch between Excel and PowerPoint on its own, and the Cowork plugins now directly integrate market data services like FactSet, MSCI, and LSEG into the workflow.