Aesthetic and Preference Metrics

Aesthetic and Preference Metrics は、画像が人間にとって自然で魅力的か、prompt に対して好ましい出力かを推定する metric です。FID や CLIPScore だけでは、人間が実際に良いと感じるかを十分に表せないため、human rating や pairwise preference data を使って評価 model を学習します。

Aesthetic Score

Aesthetic Score は、CLIP image feature を入力として、人間が 1〜10 などの尺度で付けた美的評価を予測する model です。LAION Aesthetics 系の model が代表例です。

Aesthetic Score は reference image が不要で、画像単体に対して score を出せます。Inpainting や filtering のように少数画像を評価したい場面でも使いやすいです。

CLIP-IQA

CLIP-IQA は、CLIP feature を使って image quality assessment を行う方法です。たとえば This is a high-quality image と This image has a lot of noise のような対照的 text prompt を用意し、画像 embedding がどちらに近いかを見ます。

この方法は、明示的な human score regressor を作るだけでなく、text prompt によって評価軸を柔軟に変えられる点が特徴です。

Q-Align

Q-Align は、Large Multimodal Model (LMM) を使い、画像品質や aesthetic assessment を discrete text-defined levels として学習する方法です。数値を直接回帰するのではなく、poor、fair、good のような text-defined level を介して視覚品質を評価します。

LMM の言語理解能力を使うため、画像品質を自然言語の評価軸と結びつけやすいことが利点です。

HPS

Human Preference Score (HPS) は、人間がどちらの画像を好むかという pairwise preference data から学習された metric です。HPS v2 では、prompt と image pair に対する preference を使い、CLIP 系 model を preference-aware に学習します。

Pairwise preference は次のように表せます。

P(x_w \succ x_l \mid p) = \frac{\exp(s(p, x_w) / \tau)} {\exp(s(p, x_w) / \tau) + \exp(s(p, x_l) / \tau)}

ここで、 $p$ は prompt、 $x_w$ は winner image、 $x_l$ は loser image、 $s$ は prompt-image score です。

PickScore

PickScore は Pick-a-Pic dataset の user preference を使って学習された text-image evaluation model です。HPS と近い pairwise learning の枠組みですが、同程度の品質の画像に対して tie に近い label を扱う点が特徴です。

PickScore は、人間が実際に選んだ画像の preference を反映するため、text-to-image model の practical evaluation に向いています。

ImageReward

ImageReward は、human feedback から text-to-image model の reward model を学習する方法です。ImageReward は評価 metric として使えるだけでなく、生成 model の fine-tuning や reward-guided sampling にも使えます。

Metric ごとの使い分け

Metric	入力	強いところ	注意点
Aesthetic Score	image	単一画像の見た目を高速評価	prompt alignment は見ない
CLIP-IQA	image + quality prompts	評価軸を text で指定しやすい	prompt 設計に依存する
Q-Align	image / video + query	LMM の言語能力を活用できる	model size と推論 cost が大きい
HPS	prompt + image	人間の preference に近い	training domain の bias がある
PickScore	prompt + image	user preference を反映しやすい	style や dataset bias に注意
ImageReward	prompt + image	reward model として使いやすい	reward hacking に注意

実務上の注意

Aesthetic Score が高い画像が、prompt に合っているとは限りません。
HPS や PickScore は prompt-aware ですが、task-specific constraint を完全には保証しません。
Human preference dataset は、評価者の文化、言語、style preference の影響を受けます。
Composite score を作る場合は、metric ごとの scale を normalize する必要があります。
最終的には human evaluation との correlation を確認するべきです。

数式で見る preference model

Aesthetic / preference model は、画像や回答に scalar score $r_\phi(y)$ を割り当てます。Pairwise preference data $(y^+,y^-)$ がある場合、Bradley-Terry model により、chosen が preferred される確率は次のように書けます。

P(y^+\succ y^-)=\sigma(r_\phi(y^+)-r_\phi(y^-))

Training loss は次の通りです。

\mathcal{L}_{pref}=-\log\sigma(r_\phi(y^+)-r_\phi(y^-))

この式の気持ちは、「絶対点を直接当てるのではなく、どちらが好まれるかという相対比較から score function を学習する」というものです。Aesthetic model は dataset の annotator bias、style bias、文化差を反映しやすいため、単独の最適化 target にすると reward hacking が起こる可能性があります。

主なソース

LINEヤフー Tech Blog, AIで生成された画像をどのように評価するのか？（基本編）: https://techblog.lycorp.co.jp/ja/20250812a
LAION Aesthetics Predictor: https://github.com/LAION-AI/aesthetic-predictor
CLIP-IQA: https://arxiv.org/abs/2207.12396
Q-Align: https://arxiv.org/abs/2312.17090
HPS v2: https://arxiv.org/abs/2306.09341
Pick-a-Pic / PickScore: https://arxiv.org/abs/2305.01569
ImageReward: https://arxiv.org/abs/2304.05977

Aesthetic Score​

CLIP-IQA​

Q-Align​

HPS​

PickScore​

ImageReward​

Metric ごとの使い分け​

実務上の注意​

数式で見る preference model​

関連ページ​

主なソース​