Google Stax: Evaluating AI with Custom Criteria

⚡

Key Takeaways

1Google Stax allows developers to test AI models with custom criteria, replacing intuitive judgment with concrete data.

2The tool supports various models, including those from Google, OpenAI, and Anthropic, facilitating accurate comparisons.

3Stax bridges the gap between general benchmarks and specific needs, providing an assessment tailored to real-world use cases.

💡Why it matters — Stax offers a robust solution for optimizing the performance of AI models according to specific needs, which is crucial for specialized applications.

Google Stax: A Tool for Accurate AI Evaluations

In the development of applications using large language models (LLMs), a recurring challenge is determining whether changes made to prompts actually improve outcomes. Often, developers rely on their intuition due to a lack of objective metrics, a practice known as the "gut test." This subjective approach is problematic because LLMs, by nature, produce variable outputs for similar inputs, rendering traditional unit tests ineffective.

In light of this inherent uncertainty surrounding AI models, Google Stax emerges as an innovative solution. Developed by Google DeepMind and Google Labs, this experimental tool aims to provide a precise and customized evaluation method for AI models and prompts, replacing subjective judgments with data-driven decisions.

Features of Google Stax

Google Stax stands out for its ability to simplify the evaluation of generative AI models and applications. It functions as a testing framework specifically designed to address the unique challenges posed by LLMs.

The tool allows users to define their own success criteria, going beyond generic metrics such as fluency or safety. It offers the ability to test different prompts across various models simultaneously and make informed decisions through the visualization of performance metrics, including quality, latency, and token usage. Additionally, Stax can conduct large-scale evaluations using user-specific datasets.

Stax is designed to be flexible, supporting not only Google’s Gemini models but also GPT from OpenAI, Claude from Anthropic, Mistral, and others through API integrations.

Beyond Standard Benchmarks

General AI benchmarks play a crucial role in tracking model progress at a global level. However, they often fail to reflect the specific requirements of certain domains. For instance, a model that performs well in general reasoning may struggle with specialized tasks such as compliance-focused summarization, legal document analysis, or company-specific question answering.

It is in this gap between general benchmarks and real-world applications that Stax finds its added value. It allows for the evaluation of AI systems based on user-specific data and criteria, rather than relying on abstract global scores.

Getting Started with Google Stax

Step 1: Integrating the API Key

To generate model outputs and conduct evaluations, adding an API key is necessary. Stax recommends starting with a Gemini API key, as the built-in evaluators use it by default. However, the tool can be configured to use other models. Users can add their first key during integration or later in the settings.

To compare multiple providers, it is advisable to add keys for each model to be tested, allowing for parallel comparisons without switching tools.

Step 2: Creating an Evaluation Project

Projects are the core of the workspace in Stax. Each project represents a unique evaluation experiment, such as testing a new system prompt or comparing two models.

Two types of projects are offered:

Establishing a baseline performance or testing an iteration of a model or system prompt.
Directly comparing two different models or prompts on the same dataset.

Step 3: Building Your Dataset

An effective evaluation relies on accurate data reflecting real use cases. Stax offers two main methods for constructing this dataset:

Option A: Manually Adding Data in the Prompt Playground

For those without an existing dataset, it is possible to build one from scratch:

Select the model(s) to test.
Define a system prompt (optional) to specify the role of the AI.
Add user prompts representing real user inputs.
Provide human evaluations (optional) to create baseline quality scores.

Each input, output, and evaluation is automatically recorded as a test case.

Option B: Uploading an Existing Dataset

For teams with production data, it is possible to directly upload CSV files. If the dataset does not include model outputs, simply click "Generate Outputs" and select a model to generate them.

It is recommended to include edge cases and conflicting examples in the dataset to ensure thorough testing.

Evaluating AI Outputs

Conducting a Manual Evaluation

Human evaluations can be provided on individual outputs directly in the playground or on the project benchmark. While human evaluation is considered the "gold standard," it has drawbacks: it is slow, costly, and difficult to scale.