Arbiter: the LLM that optimizes RAG page selection

⚡

Key Takeaways

1Arbiter uses an LLM to efficiently classify RAG pages.

2Each choice made by Arbiter is accompanied by clear justifications.

3The results are presented as typed objects to facilitate their defense.

💡Why it matters — Arbiter enhances accuracy and transparency in enterprise document management.

Arbiter: the LLM that optimizes RAG page selection

Large Language Models

Let a LLM choose the right RAG page: the Arbiter model at the end of retrieval.

This article is part of the Enterprise Document Intelligence series, which builds an enterprise RAG system from four components: parsing, question parsing, retrieval, and generation. It concludes the three parts of the retrieval component. The previous part, Article 7B (anchor detection), produced ranked candidates; this one arbitrates: a LLM call ranks the candidates with reasons, and the output is a typed JSON object that an auditor can defend.

Retrieval

Retrieval involves filtering on line_df and toc_df, anchor against context: the mental model from Article 7A (retrieval as filtering). The anchors themselves come from a three-step detection pipeline (Article 7B, anchor detection): keywords + embeddings in parallel, aggregation into a structural unit, and then a single LLM call at the end. This article deals with that final call.

This article concerns that single LLM call. The arbiter. It sees the keyword hits, the embedding hits, and the section in which each candidate is located, all in a structured brief, and writes a verdict per candidate with a reason. The one-liner model: detectors propose, the arbiter decides. In a single call.

The article also covers what surrounds the arbiter: the deterministic dispatcher that chooses which detectors to run per question, the "not found" path (a reliable system must be able to say no), and the unified JSON contract (RetrievalResult) that retrieval hands off to generation.

Throughout this article, we work on a single document, Attention Is All You Need (Vaswani et al. 2017, 15 pages; non-exclusive distribution license arXiv, declared on the arXiv abstract page). It contains a clean native table of contents in the PDF outline (22 entries, 3 levels deep), and the content is familiar territory for any engineer touching RAG: encoder, decoder, attention, queries, keys, values. This keeps the focus on retrieval methods rather than parsing a domain-specific corpus. This article also assumes that the document carries its own table of contents; retrieving a table from plain text is left for future work.

1. The LLM Arbiter: the single LLM call at the end

This is the LLM call that Article 7B (anchor detection) placed at step 3. This article produced candidates from each detector; this section deals with what happens to them. A candidate is a unique passage that a detector returned with its anchor (where it was matched), its aggregated unit (section, page, or chunk), and an excerpt of surrounding context. The arbiter sees them all in a single call, ranks them with reasons, and produces the final list.

The three points covered in this section:

Why score fusion (RRF and others) loses the signal that the detectors have already made available.
What structured brief to hand to the arbiter so it can rank with reasons.
What to record per candidate so that an auditor can reconstruct the decision.

The one-liner model: detectors propose, the arbiter decides, in a single call.

1.1 Score fusion is the wrong instinct

When multiple methods return candidates, the reflex is to combine their scores. The methods return numbers on different scales: cosine similarity, unbounded BM25, and integer occurrence counts. Adding them makes no sense. Normalizing the score of each method doesn’t help either, as a 0.9 cosine and a normalized 0.9 BM25 do not mean the same thing about the same candidate.

The classic answer is ** Reciprocal Rank Fusion** (RRF). It sidesteps the calibration problem by ignoring scores and using ranks:

RRF: rank-based fusion that ignores raw scores; k = 60.

Candidates ranked by multiple methods accumulate contributions; candidates seen by a single method get a small unique term. RRF is the default in many vector databases (Pinecone, Weaviate, Elastic) and works on common cases without tuning.

But RRF leaves a real signal on the floor. Why a method ranked a candidate is important. The TOC ranked it because a section title matched. The keywords ranked it because "premium" and "€" co-occurred on the same line. The embeddings ranked it because something near the page was vaguely close in vector space. RRF compresses all of this into a rank index. The agreement between methods becomes a number, and the reason for that agreement disappears.

This is exactly what an expert reads on the screen: the section title, the matched keywords, the lines around the match. We argue that a small LLM call, given the same information in a structured form, ranks better than any score fusion method.

We still log RRF when the surrounding tool uses it by default, or use it as a cheap pre-filter when the candidate pool is too large to fit in an LLM call (top-200 by RRF, then LLM on the survivors). But the ranking decision belongs to the LLM, not to a score formula.

1.2 Handing a structured brief to the LLM

The LLM is the layer that ranks. Its input is not "here are five passages, choose the best." It’s a structured brief, one line per candidate, listing what each retrieval method found:

candidate_id: stable reference (page + line range, or section + line offset).
methods: which retrieval methods showed this candidate (TOC, keywords, embeddings).
section: the TOC section in which the candidate is located, extracted from toc_df.
matched_keywords: the keywords from the analyzed question that landed on this candidate.
snippet: three to five lines of surrounding context, extracted from line_df.

The brief is what the LLM reads. It resembles what an expert sees on their screen: a section title at the top, the matched keywords in the body, the lines around the match. Much closer to that than a ranked list of cosine scores. The LLM ranks the candidates and writes a reason in one line per retained candidate, which goes directly into the audit trail. Each candidate gets one of four roles:

primary: carries the answer.
supporting: provides the context that the answer needs to make sense.
tangential: related but kept at low priority.
discarded: the LLM's rejection, with a reason recorded for the audit.

class CandidateBrief(BaseModel):
    candidate_id: str
    methods: list[str]
    matched_keywords: list[str]

class CandidateRanking(BaseModel):
    candidate_id: str
    role: Literal["primary", "supporting", "tangential", "discarded"]
    briefs: list[CandidateBrief],
) -> list[CandidateRanking]:
    """Read the structured briefs, return a ranking per candidate."""

A minimal functional arbiter. One LLM call, the entire list of candidates at once, structured output in return.

def llm_rank(question: str, briefs: list[CandidateBrief], client) -> list[CandidateRanking]:
    """Hand the structured briefs to the LLM, get a role + reason per candidate."""
    briefs_text = "\\n".join(
        f"[{b.candidate_id}] section={b.section!r}, methods={b.methods}, "
        f"matched={b.matched_keywords}, snippet={b.snippet!r}"
        f"Question: {question}\\n\\nCandidates:\\n{briefs_text}\\n\\n"
        "For each candidate, assign a role (primary, supporting, tangential, "
        "discarded) and a one-line reason. Use the candidate_id as is."
    )
    return client.responses.parse(
        model=model_chat,
        text_format=ArbiterOutput,
    ).output_parsed.rankings

Why this outperforms score fusion:

The LLM sees why each method ranked the candidate. A high cosine similarity match without keyword overlap is likely thematic noise. A TOC + keyword agreement is a real structural signal. RRF turns both into the same rank number.
The LLM can label candidates beyond keep/reject: primary, supporting, tangential. Useful when the answer has a main part and supporting context.
The LLM can signal contradictions: two passages saying different things about the same point. Common in contracts with amendments, in regulatory filings with revisions.
The justification is in plain text and goes directly into the audit trail. No line rrf_score = 0.0327 to explain to compliance.

Cost: one LLM call (about a second for a top-10 candidate pool). Cheaper than running embeddings on the entire document. Much cheaper than a wrong answer in production.

1.3 Conflicts and the audit trail

When methods disagree, the LLM must choose. Three basic rules help:

Trust the TOC on exact title matches. The author wrote that title. 3.5 Positional Encoding directly matched the keywords of the ongoing question. No statistical method should override this.
Trust co-occurring keywords with a strong signal. A line where primary and secondary keywords appear together (positional + sinusoidal, or positional + learned) is almost certainly relevant.

Arbiter: the LLM that optimizes RAG page selection

Le brief IA que les pros lisent chaque soir

Arbiter: the LLM that optimizes RAG page selection

Large Language Models

Retrieval

1. The LLM Arbiter: the single LLM call at the end

1.1 Score fusion is the wrong instinct

1.2 Handing a structured brief to the LLM

1.3 Conflicts and the audit trail

Brief IA — L'actualité IA en français