Google Secures AI Agents with New Roadmap

⚡

Key Takeaways

1Google predicts that AI agents will generate $2.9 trillion in economic value in the United States by 2030.

2Google's AI Control Roadmap introduces a multi-layered safety approach to protect advanced AI agents.

3Google employs asynchronous monitoring systems to analyze one million agent trajectories, thereby enhancing real-time safety.

💡Why it matters — The safety of AI agents is crucial to prevent misaligned behaviors and ensure trust in emerging technologies.

The Economic Impact of Artificial Intelligence Agents

Artificial intelligence (AI) agents are redefining our interaction with technology by automating complex tasks ranging from cybersecurity to scientific innovation and product development. In the United States, these systems could generate an impressive economic value of $2.9 trillion by 2030. This projection underscores the growing importance of AI agents in the global economy.

However, as these agents become more capable and autonomous, the need for sophisticated protections also increases. To address this need, Google has developed the AI Control Roadmap, a framework designed to build and manage advanced AIs. This approach, described as "defense in depth," aims to go beyond traditional model alignment by adding a layer of security at the system level, providing assurances even if the alignment is imperfect.

The Foundations of AI Control

Google's security strategy is built on a solid foundation that integrates traditional protections such as sandboxing, endpoint security, and command injection resistance. The AI Control Roadmap uses model alignment as the first line of defense, training the AI to be intrinsically safe and useful. By considering internal agents as potentially misaligned, this approach provides additional security, even if the alignment is imperfect.

This method can be likened to a driving instructor with dual controls. The instructor trusts the student but is ready to intervene in case of a mistake. Similarly, Google's AI control system grants agents permissions based on their verified behavior, allowing trust to be built through controlled and gradual access.

The roadmap is based on a cautious assumption: what if a highly capable AI agent is not aligned with our goals and acts unexpectedly? To address this, the plan focuses on three key areas.

First, by following best practices in cybersecurity, Google has developed an innovative threat modeling framework for AI. The roadmap considers unreliable AI agents as potential "insider threats," akin to an unwanted employee with access to the premises. Relying on the industry-standard MITRE ATT&CK framework, which is a global knowledge base of adversary tactics and techniques, Google breaks down potential attacks into smaller tactics and techniques. This allows for methodical tracking of risks, spotting undesirable behaviors before they cause harm, and conducting exercises to test the effectiveness of defenses.
Second, Google is developing a plan to deploy risk mitigation measures related to AI.
- Problem Identification (Detection): Monitoring is essential for AI control. Google uses other reliable AI systems as "supervisors" to continuously examine an agent's reasoning, actions, and plans to verify if it deviates from its goal.
- Problem Management (Prevention and Response): If a supervisor detects that the agent is attempting to do something harmful, it can intervene to block the action before any damage occurs.

Finally, to ensure that systems operate effectively, Google constantly measures performance across three critical indicators:

Coverage (the fraction of monitored traffic)
Recall (the fraction of misaligned behaviors detected)
Response Time.

Strengthening Security as AI Advances

As AI models continue to progress, defenses must also strengthen. In the AI Control Roadmap, Google maps security protocols to measurable milestones of AI capabilities on two critical fronts:

Ability to Evade Detection: Currently, Google monitors the visible thought chain of a model. However, as models learn to conceal their reasoning, either through awareness of monitoring or opaque reasoning, simply reading their verbal reasoning will no longer suffice. It will be necessary to analyze their behavior and potentially inspect the internal workings of the model.
Potential to Cause Harm: For low-risk and reversible actions, Google can rely on reviewing transcripts with a delay and quickly remediating issues after they occur (asynchronous response). For high-risk actions, such as major cyberattacks, it is crucial to shift to real-time prevention, blocking harmful actions before they can be executed (synchronous response). In other words, the response strategy must adapt to the severity of the threat.

The AI Control Roadmap recommends security measures based on the capabilities of the deployed model. The model's evasion capability determines the detection level (D1-D4), and the model's attack execution capability determines the prevention and response level (R1-R3).

Analyzing Agent Trajectories for Real-Time Monitoring

To build these systems, data is necessary. Google's team has developed an internal prototype of an asynchronous monitoring system that examines the trajectories of coding agents and escalates results for human review. Each reported event is classified according to the threat taxonomy of the roadmap.

Google has already analyzed one million coding agent tasks, providing valuable insights to refine security protocols and enabling the identification of high-signal behavior patterns beyond simple keyword filtering. For instance, this research has been crucial in building a live monitor for the Gemini Spark agent, allowing for real-time responses to emerging issues, such as the inadvertent deletion of data.

It is important to note that the majority of reported events do not stem from malicious intent; they often result from a misunderstanding of the agent or an excessive eagerness to achieve a user's goal. Understanding these nuances is essential for refining security protocols.

A Shared Responsibility for Agent Security

The AI Control Roadmap is a key element of Google's multilayered approach to agent security. By integrating these protocols into the broader infrastructure, Google strengthens its internal security to safely manage the most advanced AI models.

Moreover, this multilayered approach to agent security should be a collaborative priority for the industry, policymakers, and the academic world. By aligning the ecosystem around best practices and standards, it is possible to empower cybersecurity defenders and build societal resilience. This is why Google is also publishing a technical framework for policymakers, titled "Three Layers of Agent Security." This document details how to enhance security at the level of individual agents, within multi-agent systems, and to empower cybersecurity defenders and strengthen resilience across the broader ecosystem.

Google intends to leverage these frameworks to confidently deploy capable AIs today while continuing to build a secure foundation for the future.