World Models: The Robotic Revolution Inspired by ChatGPT

⚡

Key Takeaways

1World models simulate real environments to enhance humanoid robots by integrating the laws of physics.

2Companies like World Labs and AMI Labs have raised over one billion dollars to develop this technology.

3World models require vast amounts of multimodal data to accurately reproduce physical reality.

💡Why it matters — This advancement could transform humanoid robotics, enabling robots to perform a variety of tasks similar to human capabilities.

World Models: A Major Advancement for Robotics

World models are transforming the landscape of robotics by enabling robots to simulate real environments with unprecedented accuracy. This technology allows humanoid robots to enhance their capabilities by better understanding the laws of physics. Discussions around this innovation are ubiquitous in the tech ecosystem, and massive investments testify to its potential.

Since the beginning of the year, companies like World Labs, founded by AI pioneer Fei-Fei Li, have raised considerable sums, reaching one billion dollars. Similarly, AMI Labs, co-founded by Yann Le Cun, has joined the ranks of French unicorns with a funding round of $1.03 billion. The American startup Runway has also attracted $315 million in investments.

Tech giants such as Google, Meta, and NVIDIA have also recognized the potential of these models. Jeff Bezos, the head of Amazon, has become involved by co-founding Project Prometheus, which focuses on physical AI.

Immense Potential for Humanoid Robotics

The applications of world models are vast, ranging from autonomous vehicles to scientific research, and even video games. However, it is in the field of humanoid robotics that this technology could have the most significant impact. Robotic world models (RWM) promise to greatly enhance the capabilities of humanoid robots by allowing them to integrate the dynamics of the physical world.

When announcing the funding round for World Labs in February, Fei-Fei Li stated: "If AI wants to be truly useful, it must understand worlds, not just words." This statement highlights the fundamental difference between LLMs like ChatGPT and world models. While LLMs predict the next word in a text, world models predict the next "state" of the world generated by an agent's action.

To achieve this goal, world models rely on vast amounts of multimodal data, including videos, images, audio data, and information from robotic sensors. This enables them to understand and integrate the characteristics of the physical world, such as gravity, friction, and interactions with objects.

Limitations of Traditional Methods

So far, training AI for humanoid robots has relied on models like LLMs, video models, vision-language-action (VLA), 3D simulations, or real-world simulations with a teleoperator. While these methods are effective in predictable environments, they show their limitations in more complex situations.

For example, they can teach robots to move but struggle to teach them object manipulation. World models, by aggregating different types of data and integrating the laws of physics, provide a solution to these gaps. They allow robots to learn through experience, anticipating and evaluating the consequences of their actions.

Robots can thus perform thousands of iterations within the simulation, receive feedback, and adjust their behavior accordingly, without ever breaking a real object or harming anyone. This approach is akin to the learning mode of animals and humans.

Towards a "ChatGPT Moment" for Robotics

World models could enable the emergence of humanoid robots with capabilities closer to our own. Andy Chen, head of special projects at Runway, stated to Journal du Net: "In the coming months, we will experience a 'ChatGPT moment' in robotics." As world models and world simulators scale up, companies like Runway will develop increasingly larger and more powerful models.

This will pave the way for greater generalization, allowing robots to begin acting like humans, capable of performing a wide variety of tasks rather than being limited to specific functions.

Challenges of World Models

Before establishing themselves as a truly effective solution, world models face certain obstacles. To accurately represent reality in all its nuances, they require even larger amounts of data than LLMs. Even simple tasks for a human, like opening a door or picking up a glass, involve a multitude of micro-variations that can be difficult to capture.

Moreover, unlike text or images, there is currently little "action-consequence" data available. Videos alone, for instance, are insufficient, as they show what happens, not why. Finally, physical interactions are costly to record. This explains why many players in robotics and embodied AI (including 1X, Agility, Figure, and NEURA Robotics) are using the world models platform launched by NVIDIA, Cosmos, trained on over 20 million hours of real-world data.

The Importance of Data Quality

As with LLMs, another major challenge concerns the relevance of the data used to train world models. Andy Chen from Runway explains: "At Runway, we prioritize data quality over quantity." This includes, for example, collaborations with players in the film and creative industry, such as Lucasfilm. The goal is to have truly qualitative data, not just to scale up with random videos from the internet.

Are World Models the Key to AGI?

While the arrival of world models promises to propel robotics into a new era, some believe they could even represent the missing link toward artificial general intelligence (AGI). While Sam Altman and the creators of ChatGPT remain convinced that LLMs can give rise to such an entity, many experts argue that text alone will not suffice.

Yann Le Cun prefers to use the term "Advanced Machine Intelligence (AMI)," stating last February: "A chatbot can ace a law exam, but it cannot understand physical space like a cat does, the one with whiskers." It is no longer about generating the most probable continuation, as in language, but about constructing an abstract representation of the world that can ignore unpredictable elements and maintain useful structure.

By allowing AI agents to perceive the physical world in all its subtleties and interact with it, will world models pave the way toward a form of artificial consciousness? It remains difficult to say, but what is certain is that humanoid robots are about to become a little more "human."