RAG: AI Optimizes Information Retrieval

⚡

Key Takeaways

1Information retrieval is carried out in three steps to filter structured data.

2The steps include the use of keywords, the table of contents, and embeddings.

3This method enhances information search in complex documents through AI.

💡Why it matters — This approach enables companies to effectively manage complex documents by optimizing access to relevant information.

RAG: AI Optimizes Information Retrieval

Large Language Models

Anchor detection for RAG: parallel detectors, followed by a final LLM call.

Enterprise Document Intelligence [Vol.1 #7B] – Retrieval involves filtering through structured tables: keywords first, table of contents next, embeddings last.

Retrieval in an enterprise RAG system consists of filtering through two structured tables (line_df and toc_df), with each candidate carrying an anchor (where the match is found) plus context (what is developed for generation). This mental model is the subject of article 7A (retrieval as filtering). This article focuses on how anchors are produced: a three-step pipeline that executes keyword detection and embeddings in parallel, aggregates the results into a structural unit, and concludes with a single LLM call that ranks the candidates with reasons.

The user types "How is attention calculated?" on the Transformer document. Six candidate pages match attention. The correct one mentions softmax, query, key, d_k together, and is found in the section that the table of contents calls "Attention via Dot Product." Two retrievers—keywords and embeddings—both identify the set of candidates. Neither can alone determine which page actually answers the question. A third step must read the candidates side by side, along with the section in which each is located, and choose the correct one with a reason that the listener can read months later.

The three-step pipeline that follows is based on three principles:

Keywords are always active. Keyword detection is free. There is no scenario where you wouldn't want its signal. It operates on line_df and toc_df from the very first millisecond.
Embeddings work in parallel and are optional. When vocabulary inconsistencies are expected or when the question is conceptual, embeddings capture what keywords miss. With pre-calculated indices, the cost at query time is just a few microseconds. They can be ignored when the keyword signal is already clear.
A final LLM call. No "TOC reasoning" LLM step in the middle of the pipeline. The arbiter in phase 3 sees the TOC, the keyword results, the embedding results, and the structural attachment of each candidate, all in a single call. It performs reasoning on the TOC implicitly as part of the ranking.

This article examines the detectors on each table (Section 2 on toc_df, Section 3 on line_df), then the combinations between the two tables (Section 4). The arbiter's call itself, the decision tree, and the JSON output are found in article 7C (the LLM arbiter and the JSON output of retrieval).

Throughout this article, we work on a single document, Attention Is All You Need (Vaswani et al. 2017, 15 pages; non-exclusive distribution license arXiv, declared on the arXiv abstract page). It contains a clean native TOC in the PDF layout (22 entries, 3 levels deep), and the content is familiar territory for any engineer touching on RAG: encoder, decoder, attention, queries, keys, values. This keeps the focus on retrieval methods rather than analyzing a specific domain corpus. This article also assumes that the document contains its own TOC; retrieving a TOC from plain text is left for future work.

The Anchor Detection Pipeline

Anchor detection occurs in three steps. The first step executes keyword detection and embedding similarity in parallel on line_df and toc_df. The second step aggregates the results into a structural unit (section via toc_df if available, otherwise page or chunk). The third step passes the aggregated units to a single LLM call that ranks them and writes its reasoning for each choice.

Keyword detection is the always-active foundation. It matches lines whose text contains the keywords from the question, with co-occurrence boosts when multiple keywords are found in the same line or page. It is low-cost, deterministic, and auditable. There is no reason not to execute it: it costs nothing, and when it works well, it provides a strong signal to the LLM in phase 3.

Embeddings work in parallel as a second optional signal. They are useful when vocabulary inconsistencies are expected (the question says "premium," the document says "annual amount"), or when the question is conceptual rather than lexical. If you have pre-calculated the embeddings, the marginal cost is just a few microseconds at query time. Otherwise, you can completely ignore embeddings on questions where the keyword signal is already clear.

The LLM at the end sees everything: keyword results, embedding results, the structural unit to which each candidate belongs. It ranks the units once, with reasons. Two design consequences of placing the LLM at the end rather than in the middle of the pipeline:

The LLM performs reasoning on the TOC implicitly. If asked "what happens if we exit early?" regarding a document whose TOC contains "Termination" and "Penalties" (and no "Exit" section), the LLM chooses both at ranking time. There is no separate LLM "TOC reasoning" step earlier in the pipeline; the arbiter does this work as part of its single call.
The LLM resolves subtle title matches. If the question is about "the premium" but the relevant section is titled "Contract Summary," no keyword will match the title. The LLM, given the keyword results in the body lines + the structural attachment to that section, will still choose it.

Filtering on toc_df

Two detectors operate on the TOC: keyword matching (always, free) and embedding matching (optional, parallel). Both are pure scores, with no LLM at this stage. The cognitive work (choosing the right sections from a question like "what happens if we exit early?" when the relevant section is titled "Termination") happens later, in the arbiter's call. The arbiter sees the TOC and the keyword/embedding results in a single LLM call.

Below, we show a standalone reason_on_toc function as a pedagogical parenthesis: it isolates what the arbiter does internally when reasoning about the TOC. In production, you can either run it as a separate call (additional LLM cost, useful for debugging) or incorporate it into the arbiter (a total LLM call, the default preference).

What the Arbiter Reasons About

The toc_df is small enough to be transmitted in its entirety to an LLM. The arbiter (developed in article 7C, the LLM arbiter) leverages this: it reads the entire TOC and reasons about the sections that answer the question. The standalone reason_on_toc function below isolates the same logic as a separate call, useful when you want to inspect or debug the reasoning step on the TOC by itself.

Why this is important. The LLM understands semantics, but more importantly, it understands implications. "What happens if we exit early?" shares no vocabulary with "Termination," but the LLM identifies that exiting a contract means what termination implies. "How does the insurer handle a flood?" shares no vocabulary with "Claims Procedure," but the LLM identifies that handling damages is the claims process. "Are there fees for changing coverage?" may match both "Coverage Modification" and "Fee Schedule," and the LLM chooses both, with reasoning explaining why. A subtle case in production: a question about "the premium" lands on a section titled "Contract Summary." No keyword matches, but the LLM, given the body lines mentioning premium amounts attached to that section, will still choose it.

An embedding model captures "exit early ≈ termination" by similarity, but it cannot capture "exiting early implies penalties." That's reasoning, not similarity.

The cost is a mid-level LLM call (a few thousand tokens for a typical TOC), with a few hundred milliseconds of latency. When integrated into the arbiter, it costs nothing more: the arbiter would see the TOC anyway. The method is unfeasible on line_df: transmitting 12,000 lines of content to an LLM and asking it to "choose the relevant ones" is too costly, too slow, too unreliable. The small size of the TOC is what unlocks this method.

RAG: AI Optimizes Information Retrieval

Le brief IA que les pros lisent chaque soir

RAG: AI Optimizes Information Retrieval

Large Language Models

The Anchor Detection Pipeline

Filtering on toc_df

What the Arbiter Reasons About

Brief IA — L'actualité IA en français