AI29 views5 min read

Meta AI Learns Physics From Videos and Shows Surprise

Meta's new AI model, V-JEPA, learns intuitive physics by watching videos and shows surprise at impossible events, much like a human infant.

Evelyn Reed
By
Evelyn Reed

Evelyn Reed is a science correspondent for Neurozzio, specializing in the intersection of artificial intelligence, computational science, and fundamental physics. She reports on breakthroughs in advanced modeling and simulation.

Author Profile
Meta AI Learns Physics From Videos and Shows Surprise

Researchers at Meta have developed an artificial intelligence system that learns the basic principles of physics by watching videos. The model, called V-JEPA, demonstrates a form of surprise when it observes events that defy physical laws, a learning process similar to that seen in human infants.

The system achieves this by analyzing video content to build an internal model of how objects interact in the real world. This approach allows the AI to anticipate outcomes without being explicitly programmed with the laws of physics, representing a significant step toward creating more intuitive and adaptable AI.

Key Takeaways

  • Meta's V-JEPA model learns intuitive physics concepts, such as object permanence, by observing video data.
  • Unlike traditional models that predict individual pixels, V-JEPA predicts abstract representations, making it more efficient.
  • The AI shows a measurable "surprise"—a spike in prediction error—when presented with physically impossible scenarios.
  • On a standardized physics test for AI, V-JEPA achieved nearly 98% accuracy, far surpassing older models.
  • Despite its success, the current model has a very short memory, limiting its ability to understand longer, more complex events.

A New Approach to AI Vision

Most artificial intelligence systems designed to understand video content operate in what is known as "pixel space." This method treats every single pixel in a frame as equally important, which can lead to significant inefficiencies. Such models often get bogged down by irrelevant details, like the movement of leaves on a tree, while missing critical information, such as the color of a traffic light.

Yann LeCun, Meta's director of AI research, developed an architecture to overcome this limitation. The system, called Video Joint Embedding Predictive Architecture (V-JEPA), avoids pixel-level prediction. Instead, it focuses on learning higher-level abstractions of the visual information it processes.

Understanding Latent Representations

V-JEPA works by creating "latent representations," which are compact numerical summaries of essential information. For example, instead of processing thousands of pixels that make up an image of a cylinder, an AI can convert it into a few key numbers representing its height, width, and position. This process filters out noise and allows the model to focus only on what matters.

"This enables the model to discard unnecessary information and focus on more important aspects of the video," explained Quentin Garrido, a research scientist at Meta. By ignoring trivial details, V-JEPA can build a more robust and generalized understanding of the world from visual data.

How V-JEPA Learns from Observation

The training process for V-JEPA is conceptually simple. The system is shown video frames where portions are masked or hidden. However, instead of trying to guess the exact color of each hidden pixel, the model's goal is to predict the abstract, latent representation of the missing information.

This method forces the AI to develop a deeper conceptual understanding. To succeed, it must learn underlying patterns, such as the fact that solid objects do not pass through each other and that unsupported items tend to fall. It learns these rules purely through observation, without any pre-programmed knowledge of physics.

"We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics. It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors."

— Micha Heilbron, Cognitive Scientist, University of Amsterdam

This self-supervised learning process is highly efficient. Once the initial training is complete, the model can be adapted for specific tasks, like identifying actions in videos, using much less human-labeled data than traditional systems require.

Measuring AI Surprise and Performance

To test its understanding of the physical world, researchers evaluated V-JEPA using the IntPhys benchmark, a test designed to see if AI can distinguish between plausible and implausible physical events. The results were impressive.

Performance on Physics Benchmark

On the IntPhys test, V-JEPA achieved nearly 98% accuracy in identifying physically impossible actions. In contrast, well-known models that predict in pixel space performed only slightly better than random chance.

The team also quantified the model's "surprise." They measured the difference between what V-JEPA predicted would happen next in a video and what actually occurred. When a video depicted an impossible event—such as a ball rolling behind a screen and failing to reappear—the model's prediction error surged. This spike is analogous to the surprise reaction observed in infants when their expectations about object permanence are violated.

This ability to form expectations and react to their violation is crucial for developing autonomous systems, like robots, that need to navigate and interact with the physical environment safely and effectively.

Future Development and Current Limitations

Building on its initial success, the Meta team released V-JEPA 2 in June. This next-generation model has 1.2 billion parameters and was pretrained on an extensive dataset of 22 million videos. The team has already begun applying it to robotics, using the model to help plan a robot's actions after fine-tuning it with just 60 hours of robot-specific data.

However, the system still faces significant challenges. To push the technology further, the researchers designed a more difficult benchmark called IntPhys 2. On these tougher tests, V-JEPA 2 and other leading models performed only marginally better than chance.

According to Karl Friston, a computational neuroscientist at University College London, while the model is on the right track, it currently lacks a mechanism to properly encode uncertainty. It cannot quantify its confidence in a prediction when past information is insufficient.

A more fundamental limitation is its memory. Garrido noted that the model can only process a few seconds of video at a time before its memory is effectively reset. "In a sense, the model’s memory is reminiscent of a goldfish," he said. This short-term memory prevents it from understanding events that unfold over longer periods, a key hurdle for future research to overcome.