Stable Diffusion

Stable Diffusion は、text prompt から画像を生成する Latent Diffusion Model です。Pixel space で直接 diffusion を行うのではなく、VAE で圧縮した latent space の上で denoising を行うため、高解像度画像を比較的効率よく生成できます。

Stable Diffusion architecture

自作概念図。Stable Diffusion は、VAE、latent space で動く denoising U-Net、CLIP 系 Text Encoder から構成されます。Text condition は U-Net 内の Cross-Attention で取り込まれます。

三つの主要 component

Stable Diffusion は、大きく次の三つの component で構成されます。

Component	役割
VAE	image を latent に圧縮し、latent を image に戻す
Denoising U-Net	latent noise から noise を予測し、少しずつ denoise する
Text Encoder	prompt を text embedding に変換し、U-Net に condition として渡す

なぜ latent space で diffusion するのか

Pixel space の画像は高次元です。たとえば 512 × 512 × 3 の画像を直接扱うと計算量が大きくなります。Stable Diffusion は、VAE encoder で画像を低次元 latent に圧縮し、その latent に対して diffusion process を学習します。

これにより、pixel space の細部すべてを直接 denoise するよりも計算が軽くなります。

Diffusion process

Training では、clean latent $z_0$ に noise を加えて $z_t$ を作ります。

z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(0,I)

U-Net は、noisy latent $z_t$ 、timestep $t$ 、text embedding $c$ を入力として、加えられた noise $\epsilon$ を予測します。

\mathcal{L} = \mathbb{E}_{z_0, t, \epsilon}\left[\|\epsilon - \epsilon_\theta(z_t, t, c)\|^2\right]

生成時は逆に、random latent noise から開始し、U-Net が予測した noise を使って少しずつ denoise します。最後に VAE decoder で latent を RGB image に戻します。

U-Net の中で何が起きているか

Stable Diffusion の U-Net は、画像生成用の encoder-decoder 型 network です。Downsampling path、middle block、upsampling path を持ち、skip connection によって細部情報を保ちます。

Timestep は ResBlock に condition として入ります。Text prompt は Text Encoder で embedding され、U-Net の attention block に Cross-Attention として入ります。

Self-Attention と Cross-Attention

Stable Diffusion の attention block では、latent feature 同士を見る Self-Attention と、text embedding を参照する Cross-Attention が重要です。

Attention	Q	K / V	何をするか
Self-Attention	latent feature	latent feature	画像内の離れた領域の関係を扱う
Cross-Attention	latent feature	text embedding	prompt の意味を画像生成に反映する

Cross-Attention によって、a red car のような text condition が、どの latent location にどの情報を入れるかに影響します。

Text Encoder

Stable Diffusion v1 系では、CLIP の text encoder が prompt を token embedding に変換します。U-Net はこの embedding を Cross-Attention で参照します。つまり、Stable Diffusion は単に text を class label として使うのではなく、prompt の token-level representation を denoising process の中に注入しています。

Classifier-Free Guidance

Stable Diffusion では、prompt への従いやすさを制御するために Classifier-Free Guidance (CFG) がよく使われます。条件付き prediction と無条件 prediction の差を強調します。

\epsilon_{\mathrm{guided}} = \epsilon_{\mathrm{uncond}} + s\left(\epsilon_{\mathrm{cond}} - \epsilon_{\mathrm{uncond}}\right)

ここで $s$ が guidance scale です。大きくすると prompt に強く従いやすくなりますが、上げすぎると画像が不自然になることがあります。

学習時と生成時の違い

Phase	入力	目的
Training	clean image、noise、timestep、prompt	noisy latent から noise を予測できるようにする
Generation	random latent noise、prompt	noise を少しずつ取り除いて画像 latent を作る

Training では答えの noise がわかっています。Generation では答えはなく、U-Net の予測を何 step も使って latent を更新します。

なぜ Stable Diffusion は使いやすいのか

Latent space で計算するため high-resolution でも扱いやすいです。
Text Encoder と Cross-Attention により prompt conditioning が柔軟です。
LoRA、ControlNet、Textual Inversion などの拡張と相性が良いです。
Diffusers などの library によって、実装と実験がしやすいです。

主なソース

Qiita, 世界に衝撃を与えた画像生成AI「Stable Diffusion」を徹底解説！: https://qiita.com/omiita/items/ecf8d60466c50ae8295b
High-Resolution Image Synthesis with Latent Diffusion Models: https://arxiv.org/abs/2112.10752
Stable Diffusion GitHub: https://github.com/CompVis/stable-diffusion
Diffusers documentation: https://huggingface.co/docs/diffusers/index

三つの主要 component​

なぜ latent space で diffusion するのか​

Diffusion process​

U-Net の中で何が起きているか​

Self-Attention と Cross-Attention​

Text Encoder​

Classifier-Free Guidance​

学習時と生成時の違い​

なぜ Stable Diffusion は使いやすいのか​

関連ページ​

主なソース​