Reconstruction Guidance for Video Extension

Video Diffusion Models では、生成済みの video を拡張したり、低 frame rate の video の間を埋めて高 frame rate にしたりするために、ある video $\mathbf{x}^a$ に条件づけて別の video $\mathbf{x}^b$ を sampling する必要があります。Reconstruction guidance は、Video Diffusion Models（VDM; Ho と Salimans ら, 2022）で提案された、この条件づけ sampling のための手法です。

目的

$\mathbf{x}^b$ は、 $\mathbf{x}^a$ の autoregressive な拡張、または低 frame rate の $\mathbf{x}^a$ の間に挟まれる missing frame として扱われます。したがって、 $\mathbf{x}^b$ の sampling は、自分自身に対応する noisy variable に加えて、 $\mathbf{x}^a$ にも条件づけられる必要があります。

条件付き期待値

VDM では、 $\mathbf{x}^a$ で条件づけた $\mathbf{x}^b$ の posterior を次のように分解します。

\mathbb{E}_q[\mathbf{x}^b \mid \mathbf{z}_t, \mathbf{x}^a] = \mathbb{E}_q[\mathbf{x}^b \mid \mathbf{z}_t] + \frac{\sigma_t^2}{\alpha_t}\, \nabla_{\mathbf{z}_t^b} \log q(\mathbf{x}^a \mid \mathbf{z}_t)

第二項は、 $\mathbf{z}_t$ から $\mathbf{x}^a$ を復元しやすい方向への guidance を表します。 $q(\mathbf{x}^a \mid \mathbf{z}_t)$ の closed form は知られていないため、次のような Gaussian で近似します。

q(\mathbf{x}^a \mid \mathbf{z}_t) \approx \mathcal{N}\!\left(\hat{\mathbf{x}}^a_\theta(\mathbf{z}_t),\, \frac{\sigma_t^2}{\alpha_t^2}\mathbf{I}\right)

Adjusted denoising model

この近似のもとで、 $\mathbf{x}^b$ の denoising model は次のように修正されます。

\tilde{\mathbf{x}}^b_\theta(\mathbf{z}_t) = \hat{\mathbf{x}}^b_\theta(\mathbf{z}_t) - \frac{w_r \alpha_t}{2}\, \nabla_{\mathbf{z}_t^b}\, \|\mathbf{x}^a - \hat{\mathbf{x}}^a_\theta(\mathbf{z}_t)\|_2^2

ここで、 $\hat{\mathbf{x}}^a_\theta$ と $\hat{\mathbf{x}}^b_\theta$ は denoising model による $\mathbf{x}^a$ と $\mathbf{x}^b$ の reconstruction、 $w_r$ は guidance の重みです。 $w_r > 1$ のように大きな値を使うと、sample quality が改善することが報告されています。

Spatial super-resolution への応用

同じ仕組みは、低解像度 video で条件づけて高解像度 video を生成する場合にも使えます。Reconstruction guidance は、temporal な extension と spatial な super-resolution の両方に対して、自然な条件づけ機構を与えます。

目的​

条件付き期待値​

Adjusted denoising model​

Spatial super-resolution への応用​

関連ページ​

目的

条件付き期待値

Adjusted denoising model

Spatial super-resolution への応用

関連ページ