NVIDIA Drives AI with 2 Petabytes of Open Data

⚡

Key Takeaways

1NVIDIA has made over 2 petabytes of AI training data available, facilitating access for developers.

2Datasets covering robotics, biology, and sovereign AI are available on HuggingFace.

3The Nemotron Personas collection offers 40 million synthetic personas for various countries, supporting sovereign AI.

💡Why it matters — NVIDIA's initiative lowers the barriers to AI development, accelerating innovation and improving models across various sectors.

A Collaborative Approach to Scaling Reliable AI Systems and Agents

Advancements in artificial intelligence are often measured by the capacity and efficiency of models, but in reality, every training process relies on a dataset that influences the behavior of the models. As autonomous systems gain independence, training data becomes crucial in determining what they know, how they reason, and what they can safely accomplish. However, a significant portion of current data remains opaque, fragmented, or confined within specific teams.

Access to open data changes this dynamic. It provides developers with a faster and more cost-effective way to create high-quality models while facilitating evaluation and improvement across the ecosystem. It is with this in mind that NVIDIA releases open datasets alongside its models, tools, and training techniques.

AI Data Bottlenecks

Building high-quality datasets remains one of the main obstacles in AI development. Organizations often invest millions of dollars and many months, sometimes over a year, to collect, annotate, and validate data before even starting a model training session. Even after models are deployed, access to domain expertise and evaluation frameworks remains a constant challenge.

NVIDIA aims to reduce these frictions by releasing permissively licensed datasets on HuggingFace, accompanied by training recipes and evaluation frameworks available on GitHub. To date, NVIDIA has shared over 2 petabytes of AI-ready training data, spread across more than 180 datasets and over 650 open models. And this is just the beginning.

Real-World Open Datasets

NVIDIA's open data releases cover several domains, ranging from robotic and autonomous systems to sovereign AI, biology, and evaluation benchmarks. Designed by teams across NVIDIA, these datasets illustrate how data sharing can accelerate AI development in the real world.

AI Physical Collection

Robotic systems require structured multimodal data. This collection includes over 500,000 robotic trajectories, 57 million grasps, and 15 TB of multimodal data, including assets used to develop NVIDIA's vision-language-action reasoning model GR00T across various types of grippers and sensor configurations. The dataset has been downloaded over 10 million times, notably by companies like Runway, which developed its global model GWM-Robotics using the open GR00T dataset.

Nemotron Personas Collection

The Nemotron Personas are fully synthetic persona datasets grounded in real demographic distributions, producing culturally authentic and diverse individuals at scale. The collection supports the development of sovereign AI and currently includes population-scale datasets for:

United States – 6 million personas
Japan – 6 million personas
India – 21 million personas
Brazil – 6 million personas (developed with WideLabs)
Singapore – 888,000 personas (developed with AI Singapore)

The Proteina

A fully synthetic and atomistic protein dataset designed for biological modeling workflows and drug discovery. With 455,000 structures and a 73% gain in structural diversity compared to previous references, it provides molecular representations ready for design without concerns of PII or licensing.

SPEED-Bench

A standardized benchmark for evaluating speculative decoding performance. It features two divisions: a qualitative division that maximizes semantic diversity across 11 text categories, and a throughput division organized into buckets of input sequence length (1K–32K).

Retrieval-Synthetic-NVDocs-v1

This synthetic retrieval dataset provides 110,000 triplets of queries, passages, and answers generated from 15,000 public documentation files from NVIDIA.

Nemotron Training Datasets

A major component of NVIDIA's open data work is the dataset used to train and align the Nemotron model family. Over the past year, these datasets have evolved to better support reasoning, coding, and multilingual capabilities in state-of-the-art language models.

Evolution of Nemotron Pre-training

Earlier versions heavily relied on general web corpora, while newer versions emphasize higher signal domains such as mathematics, code, and STEM knowledge.

Evolution of Nemotron Post-training

As models become more capable, post-training data plays an increasingly important role in shaping model behavior. Newer versions focus on multilingual diversity, structured reasoning supervision, and agent-style interaction data.

NVIDIA is also expanding this work with open safety datasets and reinforcement learning, including Nemotron-Agentic-Safety and Nemotron-RL, a corpus of 900,000 tasks covering mathematics, coding, tools, puzzles, and reasoning.

Extreme Co-design

Designing high-quality datasets at this scale is a team effort. It requires close collaboration between data strategists, AI researchers, infrastructure engineers, and policy experts.