Genie

Genie は、DeepMind による generative interactive environment です。ラベル付き action なしの Internet video から、agent が操作できる world model を学習します。

何がすごいのか

通常の world model は、observation と action の pair を必要とします。しかし Internet video には、多くの場合 action label がありません。Genie は、action label のない video から latent action を発見し、その latent action で環境を操作できる model を学習します。

Latent action

Genie は、frame 間の変化を説明する discrete な latent action を学習します。これは、人間が実際に押した key や controller input ではありませんが、model 内では「左に動く」「ジャンプする」のような controllable factor として機能します。

Video tokenizer

High-dimensional video をそのまま扱うのではなく、discrete token に変換します。これによって transformer dynamics model が扱いやすくなります。

Dynamics model

Dynamics model は、現在の video token と latent action から次の video token を予測します。

z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, a_t^{\text{latent}})

この model を roll out することで、interactive な environment のように使えます。

World model としての意味

Genie は、「行動ラベルなし video から controllable world model を作る」方向を示しました。これは robotics や embodied AI において、大量の passive video data を活用する可能性を広げます。

主なソース

Genie project page: https://sites.google.com/view/genie-2024/
Genie paper: https://arxiv.org/abs/2402.15391

何がすごいのか​

Latent action​

Video tokenizer​

Dynamics model​

World model としての意味​

関連ページ​

主なソース​