Score Distillation Sampling

Score Distillation Sampling (SDS) は、pretrained 2D diffusion model を gradient source として使い、3D 表現を最適化する技術です。DreamFusion で導入され、その後の text-to-3D 研究の出発点になりました。

何を解いているのか

3D scene を表す parameter $\theta$ (例えば NeRF や Gaussian) を、任意の camera から render し、その image が「prompt に合う自然な画像」になるように最適化したい、というのが目的です。

SDS の基本式

ある camera から render された image $x = g(\theta)$ に noise $\epsilon$ を加え、diffusion model $\epsilon_\phi$ の予測誤差を gradient signal として使います。

\nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{t, \epsilon}\!\left[ w(t) (\epsilon_\phi(x_t; y, t) - \epsilon) \frac{\partial x}{\partial \theta} \right]

ここで、

$x_t = \alpha_t x + \sigma_t \epsilon$ は noisy image
$y$ は text prompt
$w(t)$ は時刻依存の重み
$\partial x / \partial \theta$ は differentiable renderer の Jacobian

通常の denoising score matching と違うのは、diffusion model の重みは固定で、3D parameter $\theta$ を更新する点です。

なぜこれが効くのか

直感的には、「ランダムな視点で render した image が、diffusion model の prior に従って高密度な領域に入るように $\theta$ を動かす」ことに相当します。Diffusion model が「自然な画像」を知っているので、その分布へ近づける方向に 3D が形成されます。

課題と発展

SDS には次のような問題が知られています。

問題	内容
Mode-seeking	1 つの「平均的」な見た目に収束しやすい
Janus problem	顔が複数 view にできる、対称性の崩れ
Over-saturated colors	強い CFG と SDS の組み合わせによる彩度過多
低速	per-asset 最適化が必要

これらに対して、VSD (Variational Score Distillation, ProlificDreamer)、CSD、ISM などが提案されています。

主なソース

DreamFusion paper: https://arxiv.org/abs/2209.14988
ProlificDreamer (VSD): https://arxiv.org/abs/2305.16213

何を解いているのか​

SDS の基本式​

なぜこれが効くのか​

課題と発展​

関連ページ​

主なソース​

何を解いているのか

SDS の基本式

なぜこれが効くのか

課題と発展

関連ページ

主なソース