JEPA vs Generative Models

JEPA はしばしば masked autoencoder や diffusion model と比較されます。どれも「欠けた情報を予測する」ように見えますが、何を予測するかが大きく違います。

予測対象の違い

Method	予測対象	特徴
MAE	Masked pixels / patches	低レベル detail を復元しやすい
Diffusion model	Noise / clean sample	高品質な生成ができる
Contrastive learning	Positive pair に近づける	Negative sample や augmentation 設計が重要
JEPA	Target embedding	Abstract representation を予測する

Pixel prediction の限界

Pixel prediction は、未来や隠れた部分を正確に復元しようとします。しかし、現実には未来は multi-modal であり、pixel 単位では曖昧です。例えば、人が右へ動くか左へ動くか、服のシワがどうなるかは一意ではありません。

JEPA は、pixel ではなく target representation を予測することで、この ambiguity をある程度避けます。

数式で見る予測目的の違い

各 approach の違いは、「どの空間で誤差を取っているか」を見るとかなり整理しやすくなります。Context を $x_c$ 、target を $x_t$ とすると、MAE は pixel 空間で masked target を復元します。

\mathcal{L}_{\mathrm{MAE}} = \left\|x_t-\hat{x}_t(x_c)\right\|_2^2

Diffusion model は、noisy sample に加えた noise $\epsilon$ を予測する形で学習できます。

\mathcal{L}_{\mathrm{diffusion}} = \mathbb{E}_{t,\epsilon} \left\|\epsilon-\epsilon_\theta(x_t,t)\right\|_2^2

Contrastive learning は、positive representation $z^+$ を近づけ、negative を含む候補集合の中で正しい pair を識別します。

\mathcal{L}_{\mathrm{contrastive}} = -\log \frac{\exp(\mathrm{sim}(z,z^+)/\tau)} {\sum_k\exp(\mathrm{sim}(z,z_k)/\tau)}

JEPA は、target の pixel ではなく、target encoder が出した representation を予測します。

\mathcal{L}_{\mathrm{JEPA}} = \left\| g_\phi(f_\theta(x_c)) - \mathrm{sg}(f_{\bar{\theta}}(x_t)) \right\|_2^2

この対比の気持ちは、MAE と diffusion は「見た目そのもの」に近い空間で誤差を取り、contrastive learning は「他の sample との識別」によって表現空間を作り、JEPA は「意味的・物理的に重要な target representation」を直接予測する、というものです。したがって、画像や動画を生成したい場合には diffusion が自然ですが、planning や recognition に使う representation がほしい場合には、JEPA のように embedding space で学ぶ方が無駄な pixel detail を避けやすくなります。

JEPA と world model

World model に必要なのは、必ずしも「見た目として自然な future frame」ではありません。Agent が行動を選ぶために必要なのは、future state の意味的・物理的な変化です。

JEPA は、この future state を embedding space で予測するため、world model の representation learning と相性が良いです。

どれを使うべきか

画像や video を生成したい → Diffusion / autoregressive model
強い visual representation がほしい → JEPA / contrastive / masked model
Planning に使える latent dynamics がほしい → JEPA + world model / Dreamer 系
Pixel-level reconstruction が必要 → MAE / generative model

JEPA vs Generative Models

予測対象の違い

Pixel prediction の限界

数式で見る予測目的の違い

JEPA と world model

どれを使うべきか

関連ページ

主なソース

予測対象の違い​

Pixel prediction の限界​

数式で見る予測目的の違い​

JEPA と world model​

どれを使うべきか​

関連ページ​

主なソース​

予測対象の違い

Pixel prediction の限界

数式で見る予測目的の違い

JEPA と world model

どれを使うべきか

関連ページ

主なソース