Video Diffusion Models | Haruk1y Wiki

📄️Video Diffusion Models

Video 生成のための Diffusion Model の全体像、画像生成との違い、from-scratch と adaptation の二大方針を整理します。

Video Diffusion で重要となる ε-prediction と v-prediction の違い、angular coordinate による導出を整理します。

既存 video に条件づけて新しい frame を生成するための reconstruction guidance を整理します。

Video Diffusion で用いられる 3D U-Net と Diffusion Transformer の architecture を整理します。

Imagen Video の cascaded diffusion architecture、spatial / temporal super-resolution、progressive distillation を整理します。

Sora の spacetime patch、Diffusion Transformer architecture、scaling 戦略を整理します。

Make-A-Video の pseudo-3D layer、temporal inflation、frame interpolation、training パイプラインを整理します。

1 本の video から one-shot で fine-tuning して text-to-video 編集を行う Tune-A-Video を整理します。

Runway の Gen-1 における structure と content の分離、depth による条件付け video 編集を整理します。

Video LDM と Stable Video Diffusion の temporal layer 挿入、temporal autoencoder、データ curation を整理します。

Lumiere の Space-Time U-Net（STUNet）architecture、TSR を使わない一括生成を整理します。

追加 training なしで image diffusion model を video 生成へ転用する Text2Video-Zero を整理します。

ControlNet ベースの training-free な text-to-video 生成と、cross-frame attention、interleaved smoother、hierarchical sampler を整理します。