Generative AI: From Fascination to Reliability, a Colossal Challenge

⚡

Key Takeaways

1Generative AI sparks fascination and curiosity, but its reliability remains a major challenge.

2The hallucinations of AI models illustrate the limitations of current probabilistic systems.

3The trust in AI models is often overestimated, masking their actual uncertainty.

💡Why it matters — The reliability of AI is crucial for its integration into critical decision-making processes.

The Challenge of Making Generative AI Reliable

For several years, conversations around generative AI have proliferated, captivating both the general public and technical experts. These discussions often focus on the impressive capabilities of AI models, whether it's writing complex programs or composing songs on various themes, such as love for a pet. The fascination with what these models can achieve is undeniable.

However, the question of whether these accomplishments are of high quality is essential. The possibility of a task does not guarantee its quality. Those who have studied probability or statistics know that with a sufficiently large sample space, almost anything can happen. The real challenge lies in the ability to predict the likelihood of these outcomes and to rely on them repeatedly.

This distinction is crucial, especially when it comes to building AI systems for production applications, where consistency is paramount, unlike demonstrations that showcase spectacular edge cases. As AI becomes integrated into decision-making processes, it is vital to revisit the foundations of probability theory to understand where reliability assumptions begin to crack.

1. The Complexity of Possibility Spaces

Building reliable AI systems is a far more challenging task than discussing them. To understand why, it is helpful to consider sampling spaces. Take the simple example of a coin toss, where the possible outcomes are limited to heads or tails. In contrast, a language model generating a sequence of 512 tokens with a vocabulary of 50,000 tokens creates a sampling space of size 50,000^512, an immense quantity that is difficult to comprehend.

In such a context, the proportion of useful and coherent results is minuscule compared to the countless plausible alternatives. Thus, when the model produces a possible but unlikely response, it is referred to as an hallucination. These hallucinations are not necessarily software errors, but rather the result of sampling in low-probability regions.

It would be tempting to think that adding more data could eliminate these hallucinations. However, in probabilistic systems, sampling always carries the risk of falling into low-probability areas.

2. Frequentism and Bayesianism in AI Evaluation

The evaluation of AI systems often relies on two distinct approaches. The first, inspired by frequentism, involves executing 1,000 benchmark tasks and measuring performance. If a model successfully completes 850 of these tasks, it is considered to have an accuracy of 85%.

The second approach, Bayesian, starts with expectations about the behavior of an intelligent system and adjusts these beliefs in the face of unexpected failures. This distinction is crucial because prompts are generally not independent events. For example, if a model answers nine math questions correctly, that does not guarantee the same accuracy for the tenth question.

Language models do not operate like a series of independent Bernoulli trials. Their outputs depend on prior context, hidden representations, and the density of related examples within the training distribution, making their accuracy often conditional rather than fixed.

3. The Confidence of AI Models: An Illusion?

In the field of machine learning, the Softmax function is frequently used to interpret model outputs as confidence scores. For instance, a model that assigns a 90% probability to a prediction seems confident. However, this interpretation can be misleading.

The Softmax function amplifies small differences between logits due to its exponential term. Thus, a model may appear confident not because it "knows" something, but because a slight difference has been amplified. This leads to what is known as the "overconfident fool" problem, where a system confidently asserts something incorrect without expressing uncertainty.

4. The Law of Large Numbers and Its Limits

The law of large numbers suggests that with sufficiently large samples, observed averages converge toward their expected values. This idea motivates the use of massive datasets for training models. One might think that by seeing enough examples, a model would eventually learn the truth.

However, this assumption relies on the stability of the underlying distribution, which is not the case with human language, which is constantly evolving and contains contradictions, biases, and inaccuracies. Therefore, the model does not necessarily converge toward the "truth," but rather toward dominant patterns. Thus, a common misconception in the data may be learned by the model as a likely continuation.

Spoken language varies from region to region. Even within the same city, people use the same language, expressions, and words differently, further complicating the task for AI models.

5. Stochasticity and Creativity: A Common Confusion

AI systems are often labeled as "creative" when they produce unexpected results. Yet, from a probabilistic standpoint, this could be due to something else. Temperature sampling alters the probability of selecting less likely tokens. A low temperature produces predictable results, while a high temperature increases diversity and the risk of hallucinations.

By increasing the temperature, the probability distribution flattens, sampling more frequently from low-probability results. What we perceive as creativity may actually be the model exploring improbable regions of the distribution.

6. Towards Reliable AI Systems

To build AI systems that operate consistently in real-world environments, it is crucial to focus on reliability rather than mere possibility. Several approaches can help achieve this goal:

Use techniques like Platt Scaling and isotonic regression to align confidence scores with observed performance.
Employ methods such as Bayesian neural networks or Monte Carlo Dropout to quantify a model's uncertainty.
Implement external validations to impose structures and output requirements, rather than assuming the model will naturally follow the rules.

A few years ago, the mere ability of AI systems to predict the next word was impressive. Today, the real challenge is to predict the right word consistently and reliably, especially with the constant emergence of new promising models. The next time you are impressed by an AI demonstration, ask yourself: "Is this representative of the model, or is it an exceptionally lucky sample?"

In a world where almost anything can happen, engineering focuses on what is reliable and reproducible.