NVIDIA and Google Optimize Large-Scale AI Inference

⚡

Key Takeaways

1At Google Cloud Next, NVIDIA and Google unveiled A5X instances, promising to reduce AI inference costs by ten times.

2The new systems can handle up to 960,000 GPUs, requiring sophisticated load management to avoid idleness.

3Google and NVIDIA are introducing solutions for data sovereignty, crucial for regulated sectors like finance and healthcare.

💡Why it matters — These innovations could transform the efficiency and security of large-scale AI deployments, impacting various industrial sectors.

NVIDIA and Google: A Major Breakthrough in AI Inference

At the Google Cloud Next event, NVIDIA and Google unveiled an ambitious strategy to transform the landscape of AI inference. The two tech giants presented their hardware roadmap, aiming to significantly reduce the cost of large-scale AI inference. This initiative is based on the introduction of new A5X bare-metal instances, which operate on the advanced NVIDIA Vera Rubin NVL72 systems. Thanks to close collaboration between hardware and software teams, this architecture promises to decrease the cost per token by up to ten times compared to previous generations, while also increasing the token processing throughput equivalently per megawatt.

An Infrastructure Designed for Scale

To achieve these impressive performances, it is essential to connect thousands of processors with sufficient bandwidth to avoid processing delays. The A5X instances meet this challenge by integrating NVIDIA ConnectX-9 SuperNICs with Google’s Virgo networking technology. This configuration allows for remarkable scalability, with the capacity to manage up to 80,000 NVIDIA Rubin GPUs in a single cluster, and up to 960,000 GPUs in a multi-site deployment. Such scale requires sophisticated workload management, as routing data through such a high number of parallel processors demands precise synchronization to avoid downtime.

Mark Lohmeyer, Vice President and General Manager of AI and Compute Infrastructure at Google Cloud, shared his vision for the future of AI: “At Google Cloud, we believe that the next decade of AI will be shaped by customers' ability to run their most demanding workloads on a truly integrated and optimized AI infrastructure stack.” By combining Google Cloud's scalable infrastructure with NVIDIA's advanced platforms and systems, customers can train, fine-tune, and serve a variety of models while optimizing performance, cost, and sustainability.

Data Governance and Security in the Cloud

Beyond processing prowess, data governance remains a crucial issue for businesses, especially in highly regulated sectors like finance and healthcare. These industries often hinder their machine learning initiatives due to data sovereignty requirements and risks associated with exposing sensitive information. To address these concerns, Google Gemini models, running on NVIDIA Blackwell and Blackwell Ultra GPUs, are currently in preview on Google Distributed Cloud. This deployment mode allows organizations to keep their cutting-edge models in controlled environments, close to their most sensitive data.

The architecture also integrates NVIDIA Confidential Computing, a hardware-level security protocol that ensures training models operate in a protected environment. Data and prompts remain encrypted, preventing any unauthorized party, including cloud infrastructure operators, from accessing or modifying the underlying data. For multi-tenant public cloud environments, a preview of Confidential G4 VMs, equipped with NVIDIA RTX PRO 6000 Blackwell GPUs, introduces these same cryptographic protections, providing regulated sectors with access to high-performance hardware without compromising data privacy.

Simplifying Training for Agentic AI

Building complex agentic systems requires linking large language models to sophisticated application programming interfaces while maintaining continuous synchronization of vector databases and mitigating algorithmic hallucinations. To simplify these engineering requirements, the NVIDIA Nemotron 3 Super is now available on the Gemini Enterprise Agent Platform. This platform provides developers with tools to customize and deploy reasoning and multimodal models specifically designed for agentic tasks.

Training these models at scale introduces significant operational overhead, particularly when managing cluster sizes and hardware failures during long reinforcement learning cycles. To address these challenges, Google Cloud and NVIDIA have introduced Managed Training Clusters on the Gemini Enterprise Agent Platform. This system includes a managed reinforcement learning API, built with NVIDIA NeMo RL, which automates cluster sizing, failure recovery, and task execution. This allows data science teams to focus on model quality rather than managing low-level infrastructure.

Integrating Physical Simulations in Industry

Integrating machine learning into heavy industry and manufacturing presents unique challenges. Connecting digital models to physical factories requires precise simulations and massive computing power. NVIDIA's AI infrastructure and physical AI libraries are now available on Google Cloud, providing a foundation for organizations to simulate and automate real manufacturing workflows.

Major industrial software providers, such as Cadence and Siemens, have made their solutions available on Google Cloud, accelerated by NVIDIA infrastructure. These tools support the engineering and manufacturing of heavy machinery, aerospace platforms, and autonomous vehicles. Manufacturing companies, often equipped with legacy product lifecycle management systems, may struggle to translate geometry and physics data. By using NVIDIA Omniverse libraries and the open-source NVIDIA Isaac Sim framework via the Google Cloud Marketplace, developers can overcome these hurdles to create accurate digital twins and train robotic simulation pipelines before physical deployment.

Impact on the Accelerated Computing Ecosystem

To convert these hardware specifications into tangible financial benefits, it is crucial to examine how early adopters are leveraging this infrastructure. The wide range of options, from complete NVL72 racks to fractional G4 VMs, allows customers to precisely provision acceleration capabilities for reasoning and data processing tasks.

Thinking Machines Lab uses A4X Max VMs to accelerate the training of its Tinker API. OpenAI leverages large-scale inference on NVIDIA GB300 and GB200 NVL72 systems on Google Cloud to manage demanding workloads, including ChatGPT operations. Snap has migrated its data pipelines to Spark, GPU-accelerated on Google Cloud, to reduce costs associated with large-scale A/B testing. In the pharmaceutical sector, Schrödinger utilizes NVIDIA-accelerated computing on Google Cloud to compress drug discovery simulations from several weeks to just a few hours.

The developer ecosystem evolving around these tools has rapidly expanded, with over 90,000 developers joining the joint NVIDIA and Google Cloud community in a year. Startups like CodeRabbit and Factory are applying NVIDIA Nemotron-based models on Google Cloud to conduct code reviews and run autonomous software development agents. Aible, Mantis AI, Photoroom, and Baseten are building enterprise data intelligence, video intelligence, and generative image solutions using the full-stack platform.

Together, NVIDIA and Google Cloud aim to provide a computing foundation designed to advance experimental agents and simulations toward production systems that secure fleets and optimize factories in the physical world.