Meta AI has launched V-JEPA 2, a scalable open supply world mannequin designed to study from video on the Web scale and allow strong visible understanding, future state prediction, and zero-shot planning. v-jepa 2 demonstrates the best way to generate a modular basis for clever bodily brokers, with self-surveillance studying and minimal robotic interplay information from passive Web movies, based mostly on the Co-Embedded Predictive Prediction Structure (JEPA).
Pre-removal earlier than scalable self-monitoring from 1m-hour video
The V-JEPA 2 is preprocessed with over 1,000,000 hours of internet-scale movies and 1 million pictures. Utilizing visible masks elimination objectives, the mannequin learns to reconstruct patches of masked space-time within the potential representational area. This method avoids the inefficiency of pixel-level prediction by specializing in predictable scene dynamics whereas ignoring irrelevant noise.
To broaden Jepa buying and selling forward of time to this stage, meta researchers have launched 4 essential applied sciences:
- Knowledge Scaling: We have now constructed a 22m pattern information set (videomix22m) from public sources comparable to SSV2, Kinetics, HowTo100M, YT-Temporal-1B, Imagenet, and extra.
- Mannequin Scaling: Utilizing VIT-G, we expanded the encoder capability to parameters larger than 1B.
- Coaching Schedule: We adopted a progressive answer technique and prolonged the pre-extension to 252k iterations.
- Spatial enlargement: Steadily skilled with longer, greater decision clips, reaching 64 frames at a decision of 384 x 384.
These design decisions resulted in a median accuracy of 88.2% throughout six benchmark duties, together with SSV2, Diving 48, Jester, Dynamics, Coin, and Imaget.
Perceive through masked expression studying
The V-JEPA 2 demonstrates a robust capability to know motion. The V2 benchmark achieves 77.3% Prime-1 accuracy, reaching outperform fashions comparable to Internvideo and VideoMaMaev2. For an understanding of the looks, it stays aggressive with cutting-edge picture textual content pre-registered fashions comparable to Dinov2 and Pecoreg.
Encoder representations are evaluated utilizing cautious probes to make sure that self-monitoring studying alone gives transferable, domain-independent visible options relevant to quite a lot of classification duties.
Temporal reasoning by answering video questions
To guage temporal inference, the V-JEPA 2 encoder is in keeping with the main multimodal language mannequin and is evaluated on a number of video query reply duties. Regardless of the dearth of language supervision throughout pretraining, the mannequin achieves:
- 84.0% of PerceptionTest
- 76.9% of TempCompass
- 44.5% on MVP
- 36.7% rest room bench
- 40.3% of tomatoes
These outcomes problem the belief that visible language alignment requires collaborative coaching from the beginning, indicating that preprocessed video encoders will be aligned alongside a robust generalization.
V-JEPA 2-AC: Studying potential world fashions for robotic planning
A serious innovation on this launch is the V-JEPA 2-AC, an action-conditioned variant of the pre-processed encoder. High-quality-tuned utilizing simply 62 hours of unlabeled robotic movies from the DROID dataset, V-JEPA 2-AC learns to foretell future video embeddings conditioned for robotic actions and poses. This structure is a 300m parameter transformer with consideration from the block course and is skilled utilizing teacher-oriented objectives and rollout objectives.
This permits zero-shot planning by means of mannequin predictive management. This mannequin introduces motion sequences by minimizing the gap between the imagined future state and the visible goal utilizing the intersection methodology (CEM). This has been extremely profitable in duties comparable to reaching, greedy, pick-and-place invisible robotic arms in several labs with out remuneration oversight or extra information assortment.

Benchmark: Strong Efficiency and Planning Effectivity
In comparison with baselines comparable to Octo (operational cloning) and Cosmos (potential diffusion world mannequin), V-JEPA 2-AC:
- Run the plan in about 16 seconds per step (4 minutes for Cosmos).
- Attain duties attain 100% success price.
- Higher than others in understanding and manipulating your complete object sort.

Particularly, it really works utilizing monocular RGB cameras with out calibration or environment-specific fine-tuning to boost the generalization capabilities of the discovered world mannequin.
Conclusion
Meta’s V-Jepa 2 represents a key development in scalable self-monitoring studying for bodily intelligence. By separating observations, studying from motion conditioning and leveraging large-scale passive movies, V-JEPA 2 demonstrates that frequent visible representations will be utilized in each real-world notion and management.
Please test paper, Model hugging her face and github page. All credit for this examine might be directed to researchers on this venture. Additionally, please be at liberty to comply with us Twitter And do not forget to affix us 99k+ ml subreddit And subscribe Our Newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, ASIF is dedicated to leveraging the chances of synthetic intelligence for social advantages. His newest efforts are the launch of MarkTechPost, a man-made intelligence media platform. That is distinguished by its detailed protection of machine studying and deep studying information, and is simple to know by a technically sound and vast viewers. The platform has over 2 million views every month, indicating its reputation amongst viewers.