V-JEPA

V-JEPA は、video を対象にした Joint Embedding Predictive Architecture です。Video の一部を context として使い、隠された時空間 region の representation を embedding space で予測します。

なぜ video なのか

Video は、image よりも world model に近い data です。Object が動き、camera が動き、interaction が起こります。V-JEPA は、この temporal structure から physical / semantic representation を学習することを狙います。

基本構成

Mask は image patch ではなく、spatiotemporal tube や region として設定されます。Context から target representation を予測するには、object permanence、motion、scene layout などを捉える必要があります。

数式で見る V-JEPA

V-JEPA は、I-JEPA の target block を video の時空間 block に拡張したものとして見ることができます。Video clip を $x_{t_0:t_0+T}$ 、時空間 target region を $\mathcal{T}_j$ とします。Teacher encoder が target region から出す representation は、次のように表せます。

z_{t,j} = f_{\bar{\theta}}(x_{t_0:t_0+T},\mathcal{T}_j)

Context encoder の出力を $z_c$ 、target region の位置情報を $m_j$ とすると、V-JEPA の feature prediction loss は次のように書けます。

\mathcal{L}_{\mathrm{V\text{-}JEPA}} = \sum_j \left\| g_\phi(z_c,m_j) - \mathrm{sg}(z_{t,j}) \right\|_1

ここで、 $\mathcal{T}_j$ は tube や block のような時空間 mask を表し、 $T$ は clip の時間長を表します。この式の気持ちは、「隠れた領域が動画の中でどのような意味と運動を持つかを、pixel ではなく feature として予測する」というものです。

動画の未来は本質的に multi-modal です。同じ context からでも、人や物体は複数の自然な動きを取り得ます。Pixel を一意に当てようとすると平均的で曖昧な予測になりやすいため、V-JEPA は抽象表現で誤差を取ります。これにより、見た目の細部よりも object permanence、motion、scene layout のような下流 task に必要な情報が学ばれやすくなります。

Pixel prediction ではない

V-JEPA は future frame をそのまま生成しません。Prediction は feature space で行います。これは、pixel-level な ambiguity を避け、planning に使いやすい abstract representation を学ぶためです。

World model としての位置づけ

V-JEPA は、video の未来を representation space で予測するため、world model の基礎として解釈できます。ただし、初期の V-JEPA は action-conditioned planning model というより、video representation learner に近い位置づけです。

主なソース

V-JEPA paper: https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
V-JEPA GitHub: https://github.com/facebookresearch/jepa

なぜ video なのか​

基本構成​

数式で見る V-JEPA​

Pixel prediction ではない​

World model としての位置づけ​

関連ページ​

主なソース​