LLM: Why Rephrasing Your Prompts Won't Solve Your Problems

⚡

Key Takeaways

1The failures of LLMs in production cannot be simply resolved by rephrasing prompts.

2The imprecise routing of questions shows that the problem is often structural and not related to the prompts.

3An architectural approach, rather than prompt adjustments, significantly improves the accuracy of LLMs.

💡Why it matters — Companies need to reevaluate their LLM integration strategies to avoid costly and inefficient mistakes.

The Illusion of Prompt Rephrasing

When a language model (LLM) feature fails in production, the temptation is often to modify the prompt in hopes of resolving the issue. This reflex is common, but it often proves ineffective. In the development of a production assistant for financial advisors, a log was maintained for each LLM-related failure, as well as for the solutions that actually worked. It became clear that most problems could not be fixed simply by adjusting the prompts. Effective solutions were architectural in nature. In fact, the only attempt to resolve a complex issue through prompt rephrasing worsened the situation, necessitating a rollback.

The Challenges of Routing

Routing questions posed a major problem, proving unstable in a way that simple prompt modifications could not resolve. During different executions, the same question could be handled in various ways, with no code changes. The accuracy of routing, in ambiguous cases, varied between 56% and 64%, and the results were unpredictable from one execution to the next. For example, a request to classify households by assets under management (AUM) could be interpreted as a request for clarification in one execution and receive a confident answer in the next.

The initial idea was to improve the routing prompt by adding detailed instructions to distinguish categories. However, this approach made the classifier even more unstable, leading to the removal of these additions. The problem was deeper than simple random errors. The instability stemmed from a failing structure that worsened with each domain addition. The routing process involved first making a guess about an abstract category, followed by mapping to a concrete tool. This intermediate step was prone to errors, rendering the appropriate tool inaccessible when the guess was incorrect, with no possibility for correction.

An Effective Architectural Solution

To address this issue, it was decided to eliminate the abstract category. Routing was simplified to a single step, directly choosing a concrete tool from the catalog, with the scope determined by the selected tool rather than guessed in advance. Each decision was restricted to discriminate among a limited number of tools, rather than the entire set. Example statements were anchored as structured data, rather than text in the prompt. This redesign allowed for a shift from an unstable accuracy of 98% with the old design to a temporary regression to 72% after the rewrite, ultimately achieving 100% accuracy on both evaluation suites once the anchoring was in place.

The improvement came from reducing the decisions the model had to make. This established a model for the future: offload work from the model whenever the code can handle it, and capture what remains within deterministic guardrails.

Model Errors and Missing Options

Another apparent failure was due to a wrong decision by the model, but in reality, it was a missing option. When asked to "show the first account," the model chose the closest available tool, a search for holdings, because no account listing tool existed. The model invented an ordinal to make the response compliant. The solution was to create the necessary tool, rather than modifying a prompt to apologize for the absence.

Distrust of Invented Values

Language models tend to fill gaps confidently, which can be dangerous. For example, a request to "create a task for 2 PM" inserted "2 PM" into a field where the code expected a calculated timestamp. The parser failed to interpret "2 PM" as an ISO instant, resulting in a generic server error for the user. This issue only manifested with the actual output of the model, while all offline tests succeeded with an empty argument map, which never triggered the bug.

The Importance of Validation

The fundamental principle is to never trust a value generated by the model as if it were entered or calculated. Every response containing numbers must undergo deterministic validation checks. This involves comparing the numbers in the response with those actually returned by the tool. If a figure is cited without support, the response is withheld with a message indicating that it could not be fully verified.