Generative Image Evaluation Basics

Generative Image Evaluation は、生成画像がどれだけ良いかを測るための評価体系です。通常の classification、object detection、segmentation では ground truth と比較できますが、text-to-image のような生成 task では、同じ prompt に対して多くの正解が存在します。そのため、評価では 視覚的品質、prompt alignment、多様性、人間の好み を分けて考えます。

従来の vision task との違い

従来の vision task では、正解 label や annotation が比較的明確です。

Task	代表 metric	Ground truth
Image classification	accuracy	class label
Object detection	mAP、IoU	bounding box と class
Semantic segmentation	mIoU、pixel accuracy	pixel-level mask

一方で、生成画像では「この prompt の正解画像」を一枚に定められません。たとえば 宇宙を飛ぶ象 という prompt では、構図、色、写実性、style がいくらでも変わります。したがって、生成画像評価では ground truth との差だけではなく、出力 distribution や人間の知覚に近い proxy metric を使います。

Reference-based と no-reference

種類	代表 metric	向いている task	注意点
Reference-based	PSNR、SSIM、LPIPS	restoration、super-resolution、inpainting	原画像に近いことが良いとは限らない
No-reference	Aesthetic Score、CLIP-IQA、Q-Align	text-to-image、creative generation	評価 model の bias を受ける
Distribution-level	FID、FD-DINO、CMMD	model 全体の比較	多数 sample が必要
Prompt-aware	CLIPScore、VQA Score、HPS、PickScore	prompt 条件付き生成	text-image alignment と画質が混ざる

Visual quality

Visual quality は、画像が自然で、歪みや artifact が少なく、見た目として良いかを評価します。PSNR や SSIM は原画像との pixel-level / structure-level の近さを測ります。LPIPS は deep feature space での perceptual distance を測ります。Aesthetic Score、CLIP-IQA、Q-Align は、単一画像に対して no-reference に品質を推定できます。

Prompt alignment

Prompt alignment は、prompt に書かれた object、attribute、relation、style が画像に反映されているかを評価します。CLIPScore は image embedding と text embedding の cosine similarity を使います。VQA Score や Gecko Score は、prompt を質問に分解し、生成画像に対する yes / no 判定を通じて alignment を測ります。

Human preference

Human preference 系 metric は、人間がどちらの画像を好むかという pairwise preference data から評価 model を学習します。HPS、PickScore、ImageReward などが代表です。これらは image quality と prompt alignment を同時に扱える一方で、学習 data の嗜好、style、domain に強く依存します。

代表 metric の分類表

Metric	入力	出力単位	低いほど良いか	主な用途
PSNR	reference + output	image pair	いいえ	reconstruction
SSIM	reference + output	image pair	いいえ	reconstruction
LPIPS	reference + output	image pair	はい	perceptual similarity
IS	generated set	model / set	いいえ	GAN evaluation
FID	real set + generated set	model / set	はい	distribution similarity
CLIPScore	prompt + output	image	いいえ	prompt alignment
VQA Score	question + output	image	いいえ	compositional prompt alignment
Aesthetic Score	output	image	いいえ	visual quality
HPS / PickScore	prompt + output	image / pair	いいえ	human preference

実務での使い方

まず task を決めます。Text-to-image なのか、inpainting なのか、video generation なのかで metric は変わります。
次に、reference があるかどうかを確認します。
Prompt alignment と visual quality を別々に測ります。
必要なら composite score を作りますが、weight は人間評価との correlation で検証します。
最後に、少数でも human evaluation を行い、自動 metric の妥当性を確認します。

数式で見る FID と precision / recall

FID は、実画像と生成画像の特徴分布を Gaussian と近似し、その Fréchet distance を測ります。実画像特徴の平均・共分散を $(\boldsymbol{\mu}_r,\boldsymbol{\Sigma}_r)$ 、生成画像特徴を $(\boldsymbol{\mu}_g,\boldsymbol{\Sigma}_g)$ とすると、FID は次のように書けます。

\mathrm{FID}=\|\boldsymbol{\mu}_r-\boldsymbol{\mu}_g\|_2^2+ \mathrm{Tr}\left(\boldsymbol{\Sigma}_r+\boldsymbol{\Sigma}_g-2(\boldsymbol{\Sigma}_r\boldsymbol{\Sigma}_g)^{1/2}\right)

この式の気持ちは、「特徴空間で見たとき、生成画像の平均的な位置と広がりが実画像に近いか」を測ることです。ただし、Gaussian 近似と feature extractor に依存するため、細かな failure mode をすべて捉えられるわけではありません。

Precision / recall 型の生成評価では、precision は生成 sample が実データ manifold 上にあるか、recall は実データ manifold をどれくらい覆えているかを見ます。直感的には、precision が高い model は破綻画像が少なく、recall が高い model は多様性が高いと解釈できます。

主なソース

LINEヤフー Tech Blog, AIで生成された画像をどのように評価するのか？（基本編）: https://techblog.lycorp.co.jp/ja/20250812a
Improved Techniques for Training GANs: https://arxiv.org/abs/1606.03498
GANs Trained by a Two Time-Scale Update Rule: https://arxiv.org/abs/1706.08500
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric: https://arxiv.org/abs/1801.03924

従来の vision task との違い​

Reference-based と no-reference​

Visual quality​

Prompt alignment​

Human preference​

代表 metric の分類表​

実務での使い方​

数式で見る FID と precision / recall​

関連ページ​

主なソース​