V-JEPA 2

V-JEPA 2 は、V-JEPA の後継として、より明確に world model を目指す方向に拡張された model です。Meta は、V-JEPA 2 を「物理世界を理解し、予測するための model」として位置づけています。

V-JEPA からの発展

V-JEPA は、video の masked region の representation を予測する self-supervised model でした。V-JEPA 2 では、

より large-scale な video training
長い temporal context
Object motion や physical interaction の理解
Robotics / embodied AI benchmark への接続

が強調されます。

World model としての構造

V-JEPA 2 でも、future を pixel として生成するのではなく、latent representation として予測します。これは、World Models における latent dynamics の考え方と近いです。

なぜ robotics で重要か

Robot は、行動の結果を予測する必要があります。Pixel-level video generation は重く、曖昧で、task に不要な detail も多く含みます。V-JEPA 2 のような representation prediction model は、

物体がどこに動くか
何が contact しそうか
Scene の重要な state がどう変わるか

を抽象空間で扱える可能性があります。

数式で見る V-JEPA 2 の action-conditioned 拡張

V-JEPA 2 は、まず大規模 video から action なしの representation prediction を学び、その後 robotics や embodied setting に合わせて、行動で未来表現がどう変わるかを学ぶものとして整理できます。第一段階の video pretraining は、V-JEPA と同様に次の feature prediction loss で表せます。

\mathcal{L}_{\mathrm{stage1}} = \sum_j \left\| g_\phi(z_c,m_j) - \mathrm{sg}(z_{t,j}) \right\|_1

第二段階では、現在の latent state $z_t$ と行動列 $a_{t:t+k-1}$ から、 $k$ step 先の latent state $z_{t+k}$ を予測する action-conditioned predictor $p_\psi$ を考えます。

\mathcal{L}_{\mathrm{stage2}} = \sum_{k=1}^{H} \left\| p_\psi(z_t,a_{t:t+k-1}) - \mathrm{sg}(z_{t+k}) \right\|_2^2

ここで、 $a_t$ は robot や agent の行動、 $H$ は rollout horizon、 $p_\psi$ は行動条件付きの dynamics predictor です。この式の気持ちは、「まず大量の video で世界の見え方を表す良い latent を作り、次に少量の interaction data で行動によってその latent がどう変わるかを学ぶ」というものです。

Pixel video を生成する world model では、照明、texture、背景の細部まで予測対象になります。一方で、この latent-space の目的では、planning に必要な物体配置、接触、移動可能性のような情報に計算資源を集中できます。そのため、V-JEPA 2 は video generation model というより、representation と dynamics を結びつける world model として理解できます。

Generative video model との違い

V-JEPA 2 は high-fidelity video を生成する model ではありません。見た目の細部より、行動に必要な representation を学ぶことを重視します。

観点	Video diffusion model	V-JEPA 2
出力	Pixel / latent video	Future representation
主目的	生成品質	予測・理解・planning
Loss	Denoising / likelihood 系	Embedding prediction
Downstream	Video generation	Robotics / representation / world model

主なソース

V-JEPA 2 announcement: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
V-JEPA 2 GitHub: https://github.com/facebookresearch/vjepa2

V-JEPA からの発展​

World model としての構造​

なぜ robotics で重要か​

数式で見る V-JEPA 2 の action-conditioned 拡張​

Generative video model との違い​

関連ページ​

主なソース​