Video Diffusion Models

Video Diffusion Models は、Diffusion Models を画像生成から動画生成へ拡張した model 群です。Image は frame 数が $1$ の video と見なせるため、video generation は image generation の superset と捉えることができます。

画像生成との違い

Video generation は image generation よりも難しい task です。主な理由は次の二つです。

Frame の間に temporal consistency が要求されるため、より多くの world knowledge を model に encode する必要があります。
Text や image と比べて、large-scale で high-quality な video data、特に text-video pair data を集めることが難しいです。

そのため、Video Diffusion Models では、temporal coherence の扱い方や、image-text data の活用方法が中心的な課題になります。

二大方針

Video Diffusion Models の design 方針は、大きく分けて次の二つに整理できます。

From scratch で video diffusion を学習する方針
Image generator に依存せず、video data から直接 diffusion model を training します。Architecture としては 3D U-Net や Diffusion Transformer が使われます。代表例は Imagen Video や Sora です。
Pre-trained image diffusion model を inflate して video へ拡張する方針
既存の text-to-image diffusion model に temporal layer を挿入し、新しい layer のみを video data で fine-tuning するか、training なしで適応させます。Text-image pair data から得た prior knowledge を活かせるため、text-video pair data の不足を緩和できます。代表例には Make-A-Video、Tune-A-Video、Gen-1、Video LDM と Stable Video Diffusion、Lumiere、Text2Video-Zero、ControlVideo があります。

数式で見る video diffusion

Video diffusion では、画像 $x$ の代わりに video tensor $\mathbf{x}_{1:T}$ を生成します。Forward process は video 全体に noise を加える形で書けます。

q(\mathbf{x}_t\mid \mathbf{x}_0)=\mathcal{N}\left(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I}\right)

ここで、 $\mathbf{x}_0\in\mathbb{R}^{T\times H\times W\times C}$ は clean video です。Noise prediction loss は次の通りです。

\mathcal{L}=\mathbb{E}_{t,\epsilon,\mathbf{x}_0}\left[\|\epsilon-\epsilon_\theta(\mathbf{x}_t,t,c)\|^2\right]

この式は image diffusion と同じ形ですが、network は空間方向だけでなく時間方向の相関も扱う必要があります。Video model が難しいのは、各 frame の画質に加えて、object identity や motion が時間方向に一貫している必要があるためです。

画像生成との違い​

二大方針​

数式で見る video diffusion​

関連ページ​

画像生成との違い

二大方針

数式で見る video diffusion

関連ページ