JEPA Overview

JEPA、つまり Joint Embedding Predictive Architecture は、入力の一部から別の一部を pixel ではなく embedding space で予測する self-supervised learning framework です。Yann LeCun が提唱する energy-based / predictive learning の流れにあり、world model の representation learning と強く関係します。

JEPA architecture map

自作概念図。JEPA は context の embedding から target の embedding を予測し、pixel-level reconstruction を避けます。

基本 idea

JEPA では、入力を context と target に分けます。Context は model に見せる部分、target は隠す部分または未来の部分です。

重要なのは、target を pixel そのものとして復元するのではなく、target encoder が出した abstract representation を予測することです。

なぜ pixel reconstruction を避けるのか

Pixel-level reconstruction は、background texture や照明の細かい変化など、意味理解に不要な detail まで再現しようとします。JEPA は、embedding space で予測することで、semantic / physical に重要な情報に focus します。

Approach	予測対象	特徴
Masked Autoencoder	Pixels / patches	Low-level detail まで復元しやすい
Contrastive learning	Positive / negative pair	Negative sample 設計が必要
JEPA	Target embedding	Abstract な representation を予測

Collapse をどう避けるか

Embedding prediction では、すべての input を同じ vector にしてしまう collapse が問題になります。JEPA 系では、target encoder を stop-gradient / EMA で更新したり、predictor を入れたりして、collapse を防ぎます。

World Model との関係

JEPA は、環境の未来を pixel で生成するのではなく、future representation を予測する world model として解釈できます。

これが World Models との接点です。

数式で見る JEPA の予測目的

JEPA は、入力の一部から別の部分の representation を予測します。Context encoder を $f_\theta$ 、target encoder を $f_{\bar{\theta}}$ 、predictor を $g_\phi$ とします。Context view を $x_c$ 、target view を $x_t$ とすると、典型的な loss は次のように書けます。

\mathcal{L}_{\mathrm{JEPA}} =\left\|g_\phi(f_\theta(x_c))-\mathrm{sg}(f_{\bar{\theta}}(x_t))\right\|_2^2

ここで、 $\mathrm{sg}$ は stop-gradient です。この式の気持ちは、「pixel をそのまま復元するのではなく、抽象化された target representation を予測する」というものです。Pixel-level reconstruction よりも、semantic な情報を学びやすいことが期待されます。

主なソース

I-JEPA paper: https://arxiv.org/abs/2301.08243
V-JEPA paper: https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
V-JEPA 2 announcement: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
JEPA / world model tutorial: LeCun, A Path Towards Autonomous Machine Intelligence: https://openreview.net/forum?id=BZ5a1r-kVsf

基本 idea​

なぜ pixel reconstruction を避けるのか​

Collapse をどう避けるか​

World Model との関係​

数式で見る JEPA の予測目的​

関連ページ​

主なソース​