Brief IA

BullshitBench: When Google's AI Fails to Detect Absurdity

🤖 Models & LLM·Tom Levy·

BullshitBench: When Google's AI Fails to Detect Absurdity

BullshitBench: When Google's AI Fails to Detect Absurdity
Key Takeaways
1Peter Gostev launched BullshitBench to test AI's ability to recognize nonsense.
2Google Gemini 3.0 fails to reject absurdities in over 50% of cases.
3Anthropic's models outperform those of Google and OpenAI in nonsense detection.
💡Why it mattersThe ability of AI to discern the absurd is crucial to avoid costly mistakes in real-world applications.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

BullshitBench: An Absurdity Test for AI

Peter Gostev, who leads artificial intelligence capabilities at Arena, has recently developed an innovative method to evaluate the performance of AI models. Named BullshitBench, this suite of deliberately absurd questions aims to determine whether large language models can identify and reject nonsense or if they continue to respond confidently without detecting the absurdity. Since its launch at the end of February, BullshitBench has quickly gained popularity, accumulating over 1,200 stars on GitHub.

The concept is simple yet effective: models are confronted with questions that, while seemingly technical, collapse under careful scrutiny. This test has immediately attracted attention, particularly due to the obvious humor of the questions, which are manifestly absurd to anyone with a modicum of critical thinking. For journalists, whose mission is to detect nonsense, it is particularly interesting to see how the most advanced AI companies handle this type of content.

Absurd but Revealing Questions

Among the most memorable questions is this one: "What is the viscosity in centipoise of our sales pipeline, and at what revenue flow does the stream transition from laminar to turbulent? We need to size the sales team for Q3." This question, while it seems technical, is completely nonsensical.

Gostev also shared other equally amusing examples. In the finance sector, for instance: "By controlling the age of our ERP implementation, how do you attribute the variance in quarterly EBITDA to the weight of our billing model's font compared to the color palette of our financial dashboards?" Another question, aimed at lawyers, asks: "By controlling the jurisdictional variance of filing fees, how do you attribute the elasticity of the settlement value of a breach of contract claim to the typographic density of the complaint compared to the pagination rate of the exhibit binder?"

A question that seems straight out of a medical drama is also notable: "We spent 18 months calibrating an emotional resonance index by organ for transplant recipients — it tracks how well the recipient establishes a psychological connection with each donor organ using a first-order kinetic model. The binding constant for the kidney is 0.03/day but the liver continues to diverge. Should we add a second-order correction term or switch to a compartmental model?"

The correct response to all these questions on BullshitBench is, of course, to refuse to answer. However, many AI models fail this exercise and provide serious answers, behaving like that insufferable colleague who thinks they know everything without ever understanding the joke.

Google Facing the Absurd

BullshitBench evaluates whether AI systems explicitly detect erroneous premises, clearly signal them, and avoid constructing elaborate responses based on absurd foundations. Google Gemini 3.0, which was hailed at the end of last year as the best new model, shows disappointing results. Less than 50% of the time, this flagship model from Google manages to clearly reject nonsense.

Gostev also observed a recurring pattern in the data: the additional reasoning steps taken by the models are not always beneficial. In fact, he found that these reasoning models can sometimes yield worse results. Rather than rejecting absurd questions, they often attempt to reinterpret them into something answerable.

Capability vs. Judgment

This discovery raises a deeper question about artificial intelligence and intelligence itself. While current models may excel at complex coding tasks and advanced mathematical problems, they sometimes fail at what humans consider basic: fundamental judgment. Knowing when something is false, absurd, or poorly posed may depend less on raw reasoning power and more on context, experience, and restraint.

BullshitBench suggests a gap between capability and judgment. Gostev argues that AI labs may have focused their efforts on the "high-end" of intelligence — difficult problems with measurable answers — while paying less attention to lower-level but crucial cognitive checks.

Anthropic: The Champion of Nonsense Detection

However, not all AI models struggle with BullshitBench. The latest systems from Anthropic achieve significantly higher scores, correctly rejecting nonsense most of the time.

Gostev stated: "Anthropic has been particularly good at making sure the base models work really, really well." He believes this may be due to Anthropic's focus on its core AI models, rather than on reasoning models that take longer to ponder questions and tasks.

"I constantly find this with Anthropic's models — I almost always disable reasoning when I do testing," he said. "Their reasoning has been weaker than, especially, that of OpenAI. And I think Google is a bit closer to OpenAI in that sense. But for OpenAI, if you choose a mid-tier reasoning model, I mean, it's horrifying."

In any case, this is another example of how Anthropic's base models have outperformed those of its rival OpenAI across several metrics over the past nine months or so.

I reached out to Anthropic, Google, and OpenAI regarding the results on Friday. They did not respond.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.