LLM-as-a-Judge: AI Revolutionizes Invoice Assessment
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
The Importance of Evaluation in AI Systems
As part of our ongoing exploration of agentic AI systems, we have already addressed various aspects such as reasoning, tool usage, and managing complex workflows. However, as these systems gain autonomy and capability, a crucial question arises: how can we ensure that AI is functioning correctly?
Whether discussing a single model, an AI pipeline, or a multi-agent system, it is essential to measure outcomes against an objective standard. Indeed, the capability of a system without adequate evaluation remains incomplete.
Imagine you have set up an AI pipeline to read invoices from your suppliers and extract three key pieces of information: the Invoice ID, the Total Amount, and the Supplier Name. Once the extraction is complete, this data is stored in your database. But how can you be sure that this information is accurate?
Manually verifying thousands of documents is not viable at scale. Similarly, rule-based validation can prove fragile, and simple string comparisons often fail against format variations.
LLM-as-a-Judge: An Innovative Solution
This is where the concept of LLM-as-a-Judge comes into play. Rather than relying on fragile validation logic or conducting manual audits, a language model can be used to evaluate the results. This model compares the data extracted by the AI pipeline with a ground truth, meaning values verified by humans, and produces a structured assessment that includes:
- An accuracy score
- A match classification
- A concise explanation of the decision
What is LLM-as-a-Judge?
The concept of LLM-as-a-Judge relies on using a large language model to evaluate, rather than perform, the primary task. This approach has become popular in production AI systems for several reasons:
- Scalability: it allows for the evaluation of thousands of records without requiring human examiner intervention for each one.
- Flexibility: it handles fuzzy matches, format differences, and partial responses, where a simple string comparison would fail.
- Auditability: it provides a score and a human-readable explanation for each decision.
Without a ground truth, LLM-as-a-Judge can only verify the plausibility of the extracted data. However, with known reference values, it becomes a true precision measurement tool.
In this article, we will detail a complete implementation of this concept, from creating evaluation tables to analyzing results, including generating synthetic data and building the LLM-as-a-Judge function in Snowflake Cortex.
Step-by-Step Implementation
Initial Setup
To begin, we will create a dedicated database and schema for this tutorial.
Step 1: Creating the Tables
Three tables are necessary. The extraction table contains the data extracted by your AI pipeline. The ground truth table contains the correct answers, verified by humans. Finally, the results table is where the judge records their scores.
-
Extraction Table: this represents the output of your existing invoice extraction pipeline. For this tutorial, it will be filled with synthetic data featuring a mix of correct, partially correct, and incorrect extractions.
-
Ground Truth Table: this contains the known correct values. In a real project, a team of reviewers would annotate a representative sample; even 50 to 100 verified invoices are sufficient for a meaningful reference.
-
Evaluation Results Table: the judge records one row per field per invoice here, capturing the extracted value, the ground truth, the score (from 0.0 to 1.0), a match type category, and a plain language explanation.
Step 2: Inserting Synthetic Data
Rather than waiting for a real batch of invoices, we will create 10 synthetic invoice documents with a variety of extraction outcomes. This allows us to visualize the judge's scores in a single run. For simplicity, let’s assume we have already processed the invoices and stored the data in a structured format.
Step 3: Inserting the Extractions
Next, we insert the values extracted by the AI. In a real pipeline, these values would typically be extracted into structured tables using AI_EXTRACT capabilities. For this tutorial, we assume the extraction step is already complete and that the results are available in a structured table. The focus of this article is on the LLM-as-a-Judge evaluation, not on the extraction mechanisms.
Step 4: Creating the LLM-as-a-Judge Function
This is the heart of the pipeline. We create a UDF (User-Defined Function) in Snowflake that calls the COMPLETE endpoint of Snowflake Cortex, essentially making an API call to a hosted LLM, and asks it to evaluate one field at a time.
The function takes four arguments: the field name, the extracted value, the ground truth value, and the original document text for context. It returns a JSON object with a score, a match type label, and a one-sentence explanation.
Step 5: Running the Evaluation
With the function in place, we join the extraction and ground truth tables, unpivot the three fields into separate rows (so that each field is evaluated independently), call the judge on each row, and insert the results into INVOICE_EVAL_RESULTS.
A note on the reasoning column: why a normalized CTE instead of using the reasoning string? The key architectural decision here is the normalized CTE. Rather than passing eval_result:reasoning::STRING to the output, we only extract match_type and score from the LLM result and immediately normalize match_type by removing quotes and embedded spaces. From this point on, the raw text from the LLM is no longer referenced.
A Closed-Loop Evaluation Framework
The result of this implementation is a closed-loop evaluation framework where AI outputs are continuously measured, monitored, and improved. This process is essential as agentic AI systems increasingly integrate into business workflows, ensuring ongoing enhancement and increased reliability of the data processed by AI.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.