LLM: Why Rephrasing Your Prompts Won't Solve Your Problems

Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
The Illusion of Prompt Rephrasing
When a language model (LLM) feature fails in production, the temptation is often to modify the prompt in hopes of resolving the issue. This reflex is common, but it often proves ineffective. In the development of a production assistant for financial advisors, a log was maintained for each LLM-related failure, as well as for the solutions that actually worked. It became clear that most problems could not be fixed simply by adjusting the prompts. Effective solutions were architectural in nature. In fact, the only attempt to resolve a complex issue through prompt rephrasing worsened the situation, necessitating a rollback.
The Challenges of Routing
Routing questions posed a major problem, proving unstable in a way that simple prompt modifications could not resolve. During different executions, the same question could be handled in various ways, with no code changes. The accuracy of routing, in ambiguous cases, varied between 56% and 64%, and the results were unpredictable from one execution to the next. For example, a request to classify households by assets under management (AUM) could be interpreted as a request for clarification in one execution and receive a confident answer in the next.
The initial idea was to improve the routing prompt by adding detailed instructions to distinguish categories. However, this approach made the classifier even more unstable, leading to the removal of these additions. The problem was deeper than simple random errors. The instability stemmed from a failing structure that worsened with each domain addition. The routing process involved first making a guess about an abstract category, followed by mapping to a concrete tool. This intermediate step was prone to errors, rendering the appropriate tool inaccessible when the guess was incorrect, with no possibility for correction.
An Effective Architectural Solution
To address this issue, it was decided to eliminate the abstract category. Routing was simplified to a single step, directly choosing a concrete tool from the catalog, with the scope determined by the selected tool rather than guessed in advance. Each decision was restricted to discriminate among a limited number of tools, rather than the entire set. Example statements were anchored as structured data, rather than text in the prompt. This redesign allowed for a shift from an unstable accuracy of 98% with the old design to a temporary regression to 72% after the rewrite, ultimately achieving 100% accuracy on both evaluation suites once the anchoring was in place.
The improvement came from reducing the decisions the model had to make. This established a model for the future: offload work from the model whenever the code can handle it, and capture what remains within deterministic guardrails.
Model Errors and Missing Options
Another apparent failure was due to a wrong decision by the model, but in reality, it was a missing option. When asked to "show the first account," the model chose the closest available tool, a search for holdings, because no account listing tool existed. The model invented an ordinal to make the response compliant. The solution was to create the necessary tool, rather than modifying a prompt to apologize for the absence.
Distrust of Invented Values
Language models tend to fill gaps confidently, which can be dangerous. For example, a request to "create a task for 2 PM" inserted "2 PM" into a field where the code expected a calculated timestamp. The parser failed to interpret "2 PM" as an ISO instant, resulting in a generic server error for the user. This issue only manifested with the actual output of the model, while all offline tests succeeded with an empty argument map, which never triggered the bug.
The Importance of Validation
The fundamental principle is to never trust a value generated by the model as if it were entered or calculated. Every response containing numbers must undergo deterministic validation checks. This involves comparing the numbers in the response with those actually returned by the tool. If a figure is cited without support, the response is withheld with a message indicating that it could not be fully verified.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.