Google Stax: Evaluating AI with Custom Criteria
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
Google Stax: A Tool for Accurate AI Evaluations
In the development of applications using large language models (LLMs), a recurring challenge is determining whether changes made to prompts actually improve outcomes. Often, developers rely on their intuition due to a lack of objective metrics, a practice known as the "gut test." This subjective approach is problematic because LLMs, by nature, produce variable outputs for similar inputs, rendering traditional unit tests ineffective.
In light of this inherent uncertainty surrounding AI models, Google Stax emerges as an innovative solution. Developed by Google DeepMind and Google Labs, this experimental tool aims to provide a precise and customized evaluation method for AI models and prompts, replacing subjective judgments with data-driven decisions.
Features of Google Stax
Google Stax stands out for its ability to simplify the evaluation of generative AI models and applications. It functions as a testing framework specifically designed to address the unique challenges posed by LLMs.
The tool allows users to define their own success criteria, going beyond generic metrics such as fluency or safety. It offers the ability to test different prompts across various models simultaneously and make informed decisions through the visualization of performance metrics, including quality, latency, and token usage. Additionally, Stax can conduct large-scale evaluations using user-specific datasets.
Stax is designed to be flexible, supporting not only Google’s Gemini models but also GPT from OpenAI, Claude from Anthropic, Mistral, and others through API integrations.
Beyond Standard Benchmarks
General AI benchmarks play a crucial role in tracking model progress at a global level. However, they often fail to reflect the specific requirements of certain domains. For instance, a model that performs well in general reasoning may struggle with specialized tasks such as compliance-focused summarization, legal document analysis, or company-specific question answering.
It is in this gap between general benchmarks and real-world applications that Stax finds its added value. It allows for the evaluation of AI systems based on user-specific data and criteria, rather than relying on abstract global scores.
Getting Started with Google Stax
Step 1: Integrating the API Key
To generate model outputs and conduct evaluations, adding an API key is necessary. Stax recommends starting with a Gemini API key, as the built-in evaluators use it by default. However, the tool can be configured to use other models. Users can add their first key during integration or later in the settings.
To compare multiple providers, it is advisable to add keys for each model to be tested, allowing for parallel comparisons without switching tools.
Step 2: Creating an Evaluation Project
Projects are the core of the workspace in Stax. Each project represents a unique evaluation experiment, such as testing a new system prompt or comparing two models.
Two types of projects are offered:
- Establishing a baseline performance or testing an iteration of a model or system prompt.
- Directly comparing two different models or prompts on the same dataset.
Step 3: Building Your Dataset
An effective evaluation relies on accurate data reflecting real use cases. Stax offers two main methods for constructing this dataset:
Option A: Manually Adding Data in the Prompt Playground
For those without an existing dataset, it is possible to build one from scratch:
- Select the model(s) to test.
- Define a system prompt (optional) to specify the role of the AI.
- Add user prompts representing real user inputs.
- Provide human evaluations (optional) to create baseline quality scores.
Each input, output, and evaluation is automatically recorded as a test case.
Option B: Uploading an Existing Dataset
For teams with production data, it is possible to directly upload CSV files. If the dataset does not include model outputs, simply click "Generate Outputs" and select a model to generate them.
It is recommended to include edge cases and conflicting examples in the dataset to ensure thorough testing.
Evaluating AI Outputs
Conducting a Manual Evaluation
Human evaluations can be provided on individual outputs directly in the playground or on the project benchmark. While human evaluation is considered the "gold standard," it has drawbacks: it is slow, costly, and difficult to scale.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.