Gemini Robotics-ER 1.6: Advances in Robotic Reasoning

⚡

Key Takeaways

1Gemini Robotics-ER 1.6 enhances spatial reasoning and multi-view understanding, increasing the autonomy of robots.

2The model enables the reading of complex instruments, a crucial advancement for industrial inspections with Boston Dynamics.

3Available through the Gemini API and Google AI Studio, it offers enhanced safety and success detection capabilities.

💡Why it matters — This update strengthens the efficiency and safety of robots in complex industrial environments, paving the way for smarter automation.

Introduction to Embodied Reasoning

For robots to be truly useful in our daily lives and industries, they must go beyond executing simple instructions. They need to be capable of reasoning about the physical world around them. Whether it's navigating a complex environment or interpreting the needle of a pressure gauge, embodied reasoning allows robots to bridge the gap between digital intelligence and physical action.

Introduction of Gemini Robotics-ER 1.6

Today, we introduce Gemini Robotics-ER 1.6, a significant update to our reasoning-focused model. This model enables robots to understand their environment with unprecedented accuracy. By enhancing spatial reasoning and multi-view understanding, we are bringing a new level of autonomy to the next generation of physical agents.

Advanced Reasoning Capabilities

This model specializes in critical reasoning capabilities for robotics, including visual and spatial understanding, task planning, and success detection. It acts as the high-level reasoning model for a robot, capable of executing tasks by natively calling tools like Google Search to find information, vision-language-action (VLA) models, or any other user-defined function.

Improvements Over Previous Versions

Gemini Robotics-ER 1.6 shows a significant improvement over Gemini Robotics-ER 1.5 and Gemini 3.0 Flash, specifically enhancing spatial and physical reasoning capabilities such as pointing, counting, and success detection. We are also unlocking a new capability: instrument reading, allowing robots to read complex gauges and level glasses—a use case we discovered through close collaboration with our partner, Boston Dynamics.

Availability for Developers

Starting today, Gemini Robotics-ER 1.6 is available for developers via the Gemini API and Google AI Studio. To help you get started, we are sharing a developer Colab containing examples of model setup and its use for embodied reasoning tasks.

Pointing: The Foundation of Spatial Reasoning

Pointing is a fundamental capability for an embodied reasoning model, evolving with each generation of the model. Points can be used to express many concepts, including:

Spatial reasoning: Accurate detection of objects and counting
Relational logic: Making comparisons, such as identifying the smallest item in a set; defining "from-to" relationships (e.g., moving X to location Y)
Movement reasoning: Mapping trajectories and identifying optimal grasp points
Constraint compliance: Reasoning through complex instructions like "point to each object small enough to fit in the blue cup"

Gemini Robotics-ER 1.6 can use points as intermediate steps to reason about more complex tasks. For example, it can use points to count items in an image or to identify salient points in an image to help the model perform mathematical operations to improve its metric estimates.

Pointing Example

The example below demonstrates the strengths of Gemini Robotics-ER 1.6 in pointing to multiple items, knowing when to point and when not to point. Gemini Robotics-ER 1.6 correctly identifies the number of hammers (2), scissors (1), brushes (1), pliers (6), and a collection of gardening tools that can be interpreted as either a single group or multiple points. It does not point to requested items that are not present in the image—a wheelbarrow and a Ryobi drill. In comparison, Gemini Robotics-ER 1.5 fails to identify the correct number of hammers or brushes, completely misses the scissors, hallucinates a wheelbarrow, and lacks precision in pointing to the pliers. Gemini 3.0 Flash is close to Gemini Robotics-ER 1.6 but does not handle the pliers as well.

Success Detection: The Engine of Autonomy

In robotics, knowing when a task is complete is just as important as knowing how to start it. Success detection is a cornerstone of autonomy, serving as a critical decision-making engine that enables an agent to intelligently choose between retrying a failed attempt or progressing to the next step of a plan.

Multi-View Reasoning

Achieving visual understanding in robotics is a challenge, requiring sophisticated perception and reasoning capabilities combined with extensive world knowledge to manage complicating factors such as occlusions, poor lighting, and ambiguous instructions. Additionally, most modern robotic setups include multiple camera views, such as an overhead view and a wrist-mounted view. This means a system must understand how different perspectives combine to form a coherent image at any moment and over time.

Advances in Gemini Robotics-ER 1.6

Gemini Robotics-ER 1.6 advances multi-view reasoning, allowing the system to better understand multiple camera streams and their relationship to one another, even in dynamic or occluded environments, as demonstrated in the typical multi-view scenario below. Gemini Robotics-ER 1.6 takes cues from multiple camera views to determine when the task "put the blue pen in the black pen holder" is complete.

Instrument Reading: Visual Reasoning in the Real World

To understand a key strength of Gemini Robotics-ER 1.6, we must examine how it combines capabilities such as spatial reasoning and world knowledge to solve complex real-world problems. A perfect example is instrument reading.

Collaboration with Boston Dynamics

This task arises from the needs of facility inspection, a critical focus area for our partners at Boston Dynamics. Industrial facilities contain many instruments—thermometers, pressure gauges, chemical level glasses, and more—that require constant monitoring. Spot, a robotic product from Boston Dynamics, is capable of visiting instruments throughout the facility and capturing images of them.

Instrument Reading Capability

Gemini Robotics-ER 1.6 enables robots to interpret a variety of instruments, including circular gauges, vertical level indicators, and modern digital displays.

Complex Visual Reasoning

Reading instruments requires complex visual reasoning. It involves accurately perceiving a variety of inputs—including needles, liquid levels, container boundaries, graduation marks, and more—and understanding how they relate to one another. In the case of level glasses, this involves estimating how much liquid fills the glass while accounting for distortion due to the camera's perspective. Pressure gauges typically have text describing the unit, which must be read and interpreted, and some have multiple needles referencing different decimals that must be combined.

Agentic Vision

Capabilities like instrument reading and more reliable task reasoning will enable Spot to see, understand, and react to real-world challenges in a fully autonomous manner. Gemini Robotics-ER 1.6 achieves its highly accurate instrument readings by using agentic vision, which combines visual reasoning with code execution. The model takes intermediate steps: first zooming in on an image to get a better read of small details on a gauge, then using pointing and code execution to estimate proportions and intervals for an accurate reading, and finally applying its world knowledge to interpret the meaning.

Our Safest Robotic Model Yet

Safety is integrated at every level of our embodied reasoning models. Gemini Robotics-ER 1.6 is our safest robotic model to date, demonstrating superior compliance with Gemini safety policies on adversarial spatial reasoning tasks compared to all previous generations.

Enhanced Safety

The model also shows a substantially improved ability to adhere to physical safety constraints. For example, it makes safer decisions based on spatial outputs like pointing regarding which objects can be safely manipulated under grasping or material constraints (e.g., "do not handle liquids," "do not lift objects weighing more than 20 kg").

Safety Testing

We have also tested how the model identifies safety hazards in text and video scenarios based on real accident reports. On these tasks, our Gemini Robotics-ER models improve the baseline performance of Gemini 3.0 Flash by +6% in text and +10% in video by accurately perceiving injury risks.

Let's Collaborate to Enhance Embodied Reasoning for Robotics

We are committed to ensuring that Gemini Robotics-ER delivers maximum value to the robotics community. If the current capabilities are limited for your specialized application, we invite you to submit this form with 10 to 50 labeled images illustrating specific failure modes to help us build more robust reasoning features. We look forward to collaborating with you to enhance these capabilities in our upcoming releases.