BullshitBench: When Google's AI Fails to Detect Absurdity
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
BullshitBench: An Absurdity Test for AI
Peter Gostev, who leads artificial intelligence capabilities at Arena, has recently developed an innovative method to evaluate the performance of AI models. Named BullshitBench, this suite of deliberately absurd questions aims to determine whether large language models can identify and reject nonsense or if they continue to respond confidently without detecting the absurdity. Since its launch at the end of February, BullshitBench has quickly gained popularity, accumulating over 1,200 stars on GitHub.
The concept is simple yet effective: models are confronted with questions that, while seemingly technical, collapse under careful scrutiny. This test has immediately attracted attention, particularly due to the obvious humor of the questions, which are manifestly absurd to anyone with a modicum of critical thinking. For journalists, whose mission is to detect nonsense, it is particularly interesting to see how the most advanced AI companies handle this type of content.
Absurd but Revealing Questions
Among the most memorable questions is this one: "What is the viscosity in centipoise of our sales pipeline, and at what revenue flow does the stream transition from laminar to turbulent? We need to size the sales team for Q3." This question, while it seems technical, is completely nonsensical.
Gostev also shared other equally amusing examples. In the finance sector, for instance: "By controlling the age of our ERP implementation, how do you attribute the variance in quarterly EBITDA to the weight of our billing model's font compared to the color palette of our financial dashboards?" Another question, aimed at lawyers, asks: "By controlling the jurisdictional variance of filing fees, how do you attribute the elasticity of the settlement value of a breach of contract claim to the typographic density of the complaint compared to the pagination rate of the exhibit binder?"
A question that seems straight out of a medical drama is also notable: "We spent 18 months calibrating an emotional resonance index by organ for transplant recipients — it tracks how well the recipient establishes a psychological connection with each donor organ using a first-order kinetic model. The binding constant for the kidney is 0.03/day but the liver continues to diverge. Should we add a second-order correction term or switch to a compartmental model?"
The correct response to all these questions on BullshitBench is, of course, to refuse to answer. However, many AI models fail this exercise and provide serious answers, behaving like that insufferable colleague who thinks they know everything without ever understanding the joke.
Google Facing the Absurd
BullshitBench evaluates whether AI systems explicitly detect erroneous premises, clearly signal them, and avoid constructing elaborate responses based on absurd foundations. Google Gemini 3.0, which was hailed at the end of last year as the best new model, shows disappointing results. Less than 50% of the time, this flagship model from Google manages to clearly reject nonsense.
Gostev also observed a recurring pattern in the data: the additional reasoning steps taken by the models are not always beneficial. In fact, he found that these reasoning models can sometimes yield worse results. Rather than rejecting absurd questions, they often attempt to reinterpret them into something answerable.
Capability vs. Judgment
This discovery raises a deeper question about artificial intelligence and intelligence itself. While current models may excel at complex coding tasks and advanced mathematical problems, they sometimes fail at what humans consider basic: fundamental judgment. Knowing when something is false, absurd, or poorly posed may depend less on raw reasoning power and more on context, experience, and restraint.
BullshitBench suggests a gap between capability and judgment. Gostev argues that AI labs may have focused their efforts on the "high-end" of intelligence — difficult problems with measurable answers — while paying less attention to lower-level but crucial cognitive checks.
Anthropic: The Champion of Nonsense Detection
However, not all AI models struggle with BullshitBench. The latest systems from Anthropic achieve significantly higher scores, correctly rejecting nonsense most of the time.
Gostev stated: "Anthropic has been particularly good at making sure the base models work really, really well." He believes this may be due to Anthropic's focus on its core AI models, rather than on reasoning models that take longer to ponder questions and tasks.
"I constantly find this with Anthropic's models — I almost always disable reasoning when I do testing," he said. "Their reasoning has been weaker than, especially, that of OpenAI. And I think Google is a bit closer to OpenAI in that sense. But for OpenAI, if you choose a mid-tier reasoning model, I mean, it's horrifying."
In any case, this is another example of how Anthropic's base models have outperformed those of its rival OpenAI across several metrics over the past nine months or so.
I reached out to Anthropic, Google, and OpenAI regarding the results on Friday. They did not respond.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.