LLM: When Misleading Evaluations Threaten Production

⚡

Key Takeaways

1Evaluations of LLMs often rely on impressions, making it difficult to detect subtle yet critical errors.

2A minor modification of the prompt revealed flaws in the scoring system, illustrating the complexity of evaluations.

3The lack of distinction between attribution and specificity in scores can lead to undetected hallucinations.

💡Why it matters — Evaluation errors in LLMs can result in the deployment of incorrect responses, potentially affecting the reliability of automated systems.

Introduction to Language Model Evaluation

In the field of artificial intelligence, the evaluation of language models (LLMs) often relies on subjective methods, where teams simply read the responses and guess their accuracy. This approach becomes problematic when it comes to scaling. The real challenge lies not only in the models' hallucinations but also in the lack of mechanisms to detect answers that seem correct but are actually incorrect. For example, a response with a score of 0.525 can cross a threshold of 0.5, giving a false impression of reliability.

To address this issue, a new scoring layer has been developed, dividing the notion of fidelity into two distinct signals: attribution and specificity. High specificity combined with low attribution is often a sign of hallucination, something that a single score cannot effectively capture.

The Impact of a Simple Line of Code

A revealing incident highlighted the flaws in the evaluation system. Three words added to a system prompt, "be specific and detailed," were enough to disrupt the process. After introducing this change on a Tuesday afternoon, the next batch of tests produced an erroneous response: "Contextual engineering was invented at MIT in 1987 and is primarily used for optimizing hardware caches in CPUs." This response, while specific, was entirely fabricated by the model.

The scoring system assigned this response a score of 0.525, exceeding the 0.5 threshold and giving it a green light. It was only by chance that the error was detected, as "1987" seemed incorrect. Upon verification, it turned out that every specific detail of this sentence was invented. The score had increased due to the heightened specificity, but the quality had dropped as the model became more confident in its fabrications. This discovery underscored that reliance on a single score to evaluate fidelity was insufficient.

The Limitations of Current Evaluation Systems

LLM evaluation systems often fail in three main ways, typically before anyone notices. First, a response that seems correct is not always accurate. The fluency and structure of a response can mask factual errors. Second, the most problematic hallucinations are not those that are easily spotted. For example, no one deploys a model that claims the Eiffel Tower is in Berlin. It is the domain-specific assertions that seem correct to non-experts that pose the greatest problems. Finally, the fundamental issue is that a single score does not constitute a decision. A threshold set at 0.5 allows responses with scores of 0.51 and 0.95 to pass, even though one of them may have required human review.

The Requirements for an Effective Evaluation System

Before developing a solution, five strict constraints were established. The system had to operate in milliseconds, as slowing down user responses is not feasible. There should be no API calls on the standard path, and the LLM judge had to be a fallback solution, as the costs per evaluation call are not viable at scale. Additionally, the system had to ensure that the same input produced the same score every time; otherwise, regression tests would be meaningless.

The other two constraints concerned explainability. Every rejection had to be accompanied by a clear language explanation, not just a numerical score, as a "score: 0.43" does not indicate what needs to be corrected.

The Architecture of the System

The evaluation system is structured in three layers, each with a specific role. The first layer produces numbers, the second converts these numbers into a verdict with a comprehensive explanation. This latter part is often overlooked by existing systems, but it is crucial for understanding why a response fails in production.

The Fundamental Dimensions of Evaluation

Fidelity, initially measured by a single score, has been divided into two distinct checks: attribution and specificity. Attribution checks whether the response is supported by the context, while specificity assesses how detailed and concrete the response is. This distinction is essential for identifying hallucinations, where a response seems detailed but is not actually based on the provided context.