The original version of this story appeared in Quanta Magazine.
Here’s a test you can try with infants: Place a glass of water on a desk and then hide it behind a wooden board. If you move the board toward the glass and it passes by as if the glass isn’t there, do the infants seem surprised? Many 6-month-olds are indeed surprised, and by the time they reach a year, almost all kids grasp the idea of object permanence through what they observe. Now, some AI models are catching on, too.
Researchers have developed an AI system that learns about the world by watching videos and shows signs of “surprise” when it encounters information that contradicts its acquired knowledge. This model, created by Meta and known as Video Joint Embedding Predictive Architecture (V-JEPA), doesn’t start with any assumptions about the physics of the scenes in the videos. Yet, it begins to make sense of how things operate in the world.
“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who examines how both brains and artificial systems interpret the world.
Higher Abstractions
As engineers developing self-driving cars know, getting an AI system to reliably interpret its surroundings can be quite a challenge. Most systems designed to “understand” videos do so either to classify their content (like “a person playing tennis”) or to identify outlines of objects, such as a car on the road. These models operate in what’s called “pixel space,” treating each pixel in the video as equally important.
However, these pixel-space models have their downsides. For instance, consider a suburban street scene filled with cars, traffic lights, and trees. The model might get bogged down in unnecessary details, like the movement of the leaves, while missing crucial information such as the traffic light’s color or the locations of nearby cars. “When you deal with images or video, you don’t want to work in [pixel] space because there are too many details you’d rather not model,” explains Randall Balestriero, a computer scientist at Brown University.
Yann LeCun, a computer scientist at New York University and the head of AI research at Meta, created JEPA, the predecessor to V-JEPA, that focuses on still images in 2022.



