
Imagine a child who, after seeing a ball roll behind a sofa, instinctively knows that it continues to exist and can anticipate the precise spot where it will reappear. This fundamental ability, which psychology calls object permanence , forms a cornerstone of human intelligence. We don’t simply react to the images that strike our retinas; we are constantly simulating the future in our minds.
Today, artificial intelligence is attempting to cross this crucial threshold. After the era of models capable of generating text, like ChatGPT, or images, like Midjourney, a new frontier is emerging with world models . The stakes are high: it’s about equipping machines with a form of physical, spatial, and logical common sense so that they stop imitating… and finally begin to understand.
These models are already showing promising results in the laboratory or in simulated environments. However, their maturity remains limited and their real-world deployment is still restricted today.
Why do current AI technologies remain partially limited?
The most famous AI systems today are generative models , such as Claude or ChatGPT. These excel at predicting the next word in a sentence or the next pixel in an image, relying on monumental statistical correlations.
From this basic idea, the first measurable evidence of reasoning and functional common sense was observed in the history of artificial intelligence (AI). However, as researchers in the field, such as Yann LeCun, scientific director of AMI Labs, or Fei-Fei Li, scientific director of Worldlabs , regularly point out, these models do not have a consistent internal representation of physical reality.
This is what explains, in particular, their famous hallucinations : a language model can assert with absolute certainty that a cow’s egg is a classic cooking ingredient, simply because it manipulates concepts without fully understanding the biological constraints of the real world. To move beyond this stage of ” stochastic parroting ” (stochastic referring to a phenomenon or model that incorporates randomness in a structured way, like a probability calculation where the unexpected becomes a key factor), AI must integrate an architecture capable of modeling causes and effects.
This ambition is not new, but it now benefits from an unprecedented technological alignment. As early as 1943, neuroscientist Kenneth Craik suggested that the human brain functions by constructing small-scale models of reality to anticipate events. Thus, when we cross the street, our brain imagines in advance the trajectory of cars to know when it is safe to cross.
What has changed since then is that we now have the computing power and mathematical frameworks necessary to test this hypothesis on the scale of complex machines. Interest in these models exploded following the pioneering work of David Ha and Jürgen Schmidhuber in 2018. They showed that an AI could learn to drive in a virtual environment by training almost exclusively in its own “dreams.” These “dreams” are internal simulations, created by the AI itself, that allow it to test different strategies without interacting with the real world.
The architecture of world models
These authors introduced the concept of a “world model”: an internal and structured representation of an environment that allows an agent to anticipate the consequences of their actions. The virtual model synthesizes observable information to construct an abstract and manipulable version of the real world, facilitating planning, simulation, and decision-making, even in complex or uncertain situations. Technically, a world model relies on a mechanism for information compression and prediction.
Rather than simply identifying objects like “cat” or “ball” after learning, a world model learns to represent the world in a richer and more structured way.
Initially, the system observes enormous amounts of data and extracts a compact representation of essential dynamics , such as the trajectory of an object, the rigidity of a surface, or the spatial interactions between several elements (the cat’s paw playing with the ball). This abstraction is not limited to labels: it captures physical and logical regularities of the world.
In a second step, the model can simulate future scenarios using this representation (the ball goes under an armchair and the cat tries to clear it). Thus, if the agent equipped with the previously described world model considers an action, it can predict its consequences even before executing it, in a potentially uncertain or noisy environment.
In other words, unlike the simple statistical classification “this is a cat”, the world model learns a kind of internal mini-simulation of the world, which combines perception, spatial and logical understanding, and the ability to anticipate.
Here, the approach remains statistical, similar to reinforcement learning, but without direct recourse to explicit physical models; it relies solely on regularities observed in the data (balls that roll under objects either come out or get stuck). This distinction between statistical and physical approaches becomes important when dealing with complex and uncertain environments, where predictions must incorporate the natural variability of the real world.
Several recent proposals illustrate the potential of the statistical approach to world models. Meta’s V-JEPA model , for example, learns to understand complex physical interactions simply by watching videos, without any human labeling. Meanwhile, Google DeepMind recently unveiled Genie , an architecture capable of creating interactive virtual worlds from a single photograph, proving that the machine has previously assimilated the laws of physics and perspective .
Applications that affect society
The repercussions of this technology are massive and extend far beyond the realm of theoretical computer science.
In robotics, for example, an agent equipped with a model of the world could learn to manipulate fragile objects or move around in a crowded warehouse without going through thousands of hours of costly and risky physical testing.
In the field of autonomous vehicles, pioneers like Wayve claim to use models of the world so that cars can anticipate the unpredictable behavior of pedestrians or other drivers, where conventional systems would simply react with a time delay.
In the field of healthcare, digital twins are still in the exploratory phase and are used to simulate how a disease might evolve in response to an experimental treatment. However, these models do not provide definitive predictions: they are considered “probabilistic,” meaning they rely on probability calculations. In other words, they estimate several possible outcomes for a patient (improvement, stability, worsening) and assign each a probability of occurring, based on available data and statistical models. Consequently, these simulations remain estimates, not certainties. They must therefore be validated very rigorously, especially when they concern treatments that have never been tested in real-world clinical settings.
Advances in AI are leading us to rethink what it truly means to “understand” and “anticipate” in a complex world. Ultimately, exploring these questions could transform not only the technology itself, but also our understanding of human cognition and creativity.
It is important to temper the enthusiasm surrounding these models. Despite the progress, it remains, for the time being, at the research and development stage. For example, in robotics and autonomous vehicles, the majority of applications are still at the prototype or pilot stage, often in highly structured environments.
Large-scale adoption will require overcoming major technical and regulatory challenges , such as robustness in the face of unforeseen situations and security in complex real-world environments. Therefore, these models are in an advanced experimental phase and are not yet operational everywhere and at all times – even though their prospects remain very promising.
Author Bio: Julien Perez is a Lecturer – AI and Machine Learning at EPITA