VQ-VAE

VQ-VAE、つまり Vector Quantized Variational Autoencoder は、continuous な latent variable ではなく、discrete な latent representation を使う VAE 系の model です。

基本 idea

通常の VAE では、latent variable $z$ は Gaussian distribution などの continuous distribution から sampling されます。一方で、VQ-VAE では、encoder の出力を codebook に含まれる embedding vector のうち最も近いものへ quantize します。

VQ-VAE

画像出典: Lilian Weng, “From Autoencoder to Beta-VAE”。Encoder の出力を codebook の embedding に割り当て、decoder がそこから reconstruction を生成します。

Codebook

VQ-VAE では、codebook $e = \{e_1, e_2, \dots, e_K\}$ を持ちます。Encoder の出力 $z_e(x)$ に最も近い embedding を選び、quantized latent $z_q(x)$ として使います。

z_q(x) = e_k, \quad k = \arg\min_j \|z_e(x) - e_j\|_2

この discrete latent representation によって、画像、音声、言語のような discrete structure を持つ data を扱いやすくなります。

VQ-VAE-2

VQ-VAE-2 は、VQ-VAE を hierarchical に拡張した model です。Top-level latent は global な structure を捉え、bottom-level latent は local な detail を捉えます。

VQ-VAE-2

画像出典: Lilian Weng, “From Autoencoder to Beta-VAE”。VQ-VAE-2 は複数 level の discrete latent variable を使います。

VQ-VAE-2 algorithm

画像出典: Lilian Weng, “From Autoencoder to Beta-VAE”。VQ-VAE-2 の training と sampling の流れが示されています。

数式で見る VQ-VAE の codebook quantization

VQ-VAE では、encoder 出力 $z_e(x)$ を連続値のまま使わず、codebook $\{\mathbf{e}_k\}_{k=1}^{K}$ の最近傍 vector に置き換えます。

k^*=\arg\min_k\|z_e(x)-\mathbf{e}_k\|_2^2, \qquad z_q(x)=\mathbf{e}_{k^*}

Decoder は quantized latent $z_q$ から入力を復元します。

\hat{x}=D_\theta(z_q(x))

Training loss は、reconstruction、codebook、commitment の三つに分かれます。

\mathcal{L}=\|x-\hat{x}\|^2 +\|\mathrm{sg}[z_e(x)]-\mathbf{e}_{k^*}\|^2 +\beta\|z_e(x)-\mathrm{sg}[\mathbf{e}_{k^*}]\|^2

この式の気持ちは、「encoder は codebook のどれかに近い表現を出し、codebook は encoder 出力に追随し、decoder は離散 latent から復元する」というものです。離散 token 化された latent は、autoregressive model や transformer と組み合わせやすくなります。

基本 idea​

Codebook​

VQ-VAE-2​

数式で見る VQ-VAE の codebook quantization​

関連ページ​

基本 idea

Codebook

VQ-VAE-2

数式で見る VQ-VAE の codebook quantization

関連ページ