World Models Overview

World Model は、agent が観測した経験から、環境の状態と変化を内部表現として学習する model です。単に次の frame を予測するだけではなく、「行動すると環境がどう変わるか」「何が報酬につながるか」「どの future が望ましいか」を予測し、planning や control に使うことを目指します。

World Models map

自作概念図。World model は experience から latent representation と dynamics を学び、prediction と planning を通じて agent の行動に接続します。

World Model の基本構成

多くの world model は、次の component を持ちます。

Component	役割
Encoder	観測を latent state に変換します。
Dynamics model	現在 state と action から future state を予測します。
Decoder / prediction head	観測、reward、termination、semantic state などを予測します。
Planner / policy	Model 内で imagined rollout を行い、行動を選びます。

World Model と video generation の違い

Video generation model は、見た目として自然な future video を作ることを重視します。一方で world model は、agent の行動に必要な予測を重視します。

観点	Video generation	World model
入力	text / video context	observation + action
出力	future frames	future latent, reward, state, observation
目的	realism	planning / control
評価	visual quality	task success, prediction utility

ただし、近年は video diffusion model、robotics、autonomous driving simulator が近づいており、両者の境界は曖昧になっています。

トップカンファレンスでの流れ

近年よく話題になる方向は次の通りです。

Latent dynamics model (Dreamer 系)
Action-free video world model (Genie 系)
Autonomous driving world model (GAIA-1、DriveDreamer など)
Diffusion / transformer による future generation
JEPA / V-JEPA による representation-first world model
Embodied AI / robotics での planning と control

数式で見る world model

World model は、観測 $o_t$ と action $a_t$ から、未来の latent state や observation を予測する model として書けます。

\mathbf{z}_t=E(o_t), \qquad \mathbf{z}_{t+1}\sim p_\theta(\mathbf{z}_{t+1}\mid \mathbf{z}_t,a_t)

ここで、 $E$ は encoder、 $\mathbf{z}_t$ は latent state です。Observation decoder を持つ場合は、次のように画像や sensor observation を再構成します。

\hat{o}_t\sim p_\theta(o_t\mid \mathbf{z}_t)

Planning は、latent dynamics 上で未来 reward を最大化する action sequence を選ぶ問題です。

\max_{a_{t:t+H}}\mathbb{E}\left[\sum_{k=0}^{H}\gamma^k r(\mathbf{z}_{t+k},a_{t+k})\right]

この式の気持ちは、「現実世界で何度も試行錯誤する代わりに、内部 model で未来を想像してから行動を決める」というものです。World model の性能は、長期 rollout で error がどれくらい蓄積するかに強く依存します。

主なソース

World Models (Ha and Schmidhuber): https://worldmodels.github.io/
DreamerV3: https://arxiv.org/abs/2301.04104
Genie: https://sites.google.com/view/genie-2024/
GAIA-1: https://arxiv.org/abs/2309.17080

World Model の基本構成​

World Model と video generation の違い​

トップカンファレンスでの流れ​

数式で見る world model​

関連ページ​

主なソース​

World Model の基本構成

World Model と video generation の違い

トップカンファレンスでの流れ

数式で見る world model

関連ページ

主なソース