Reward Model

Reward Model (RM) は、prompt と response を受け取り、人間がどれくらい好みそうか をスコアとして出力する model です。Classical RLHF (PPO) の中心 component で、preference data から学習されます。

Bradley-Terry model

Pairwise preference を確率モデル化する標準的方法が Bradley-Terry model です。Response $y_w$ (winner) と $y_l$ (loser) があるとき、

P(y_w \succ y_l \mid x) = \sigma\bigl(r(x, y_w) - r(x, y_l)\bigr)

ここで $r(x, y)$ は reward。Reward model はこの確率を最大化するように学習されます。

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[\log \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)\right]

Architecture

実装上は、

Base LLM の最終層の上に scalar head を載せる
通常は SFT 済みモデルから初期化する
Response 末尾 token の hidden state を使う

Outcome RM と Process RM

種別	評価対象
Outcome RM (ORM)	最終 response 全体に対するスコア
Process RM (PRM)	推論ステップごとの正誤 / 良さ

PRM は、reasoning model の training や test-time search (Best-of-N、tree search) で重要です。OpenAI の PRM800K、Math-Shepherd などが代表です。

Reward model の落とし穴

問題	内容
Reward hacking	Policy が RM の盲点を突くような response を生成
Overoptimization	RM スコアは伸びるが、真の好みは悪化
Distribution drift	Policy が変化して RM の precise 領域から外れる
Annotator bias の継承	Verbosity / sycophancy
Style ≠ quality	長い・整った見た目を高評価しがち

これらを抑えるため、

KL regularization で policy を SFT から離しすぎない
Reward model ensembling や uncertainty
On-policy preference の継続収集
Length normalization

などが行われます。

DPO との関係

DPO は、reward model を 陽に持たず、policy の log-ratio として暗黙的に reward を表します。詳細は DPO を参照してください。

数式で見る reward model の pairwise training

Reward model は、prompt $x$ と response $y$ に scalar score $r_\phi(x,y)$ を割り当てます。Preference data は chosen $y^+$ と rejected $y^-$ の pair です。

P(y^+\succ y^-\mid x)=\sigma(r_\phi(x,y^+)-r_\phi(x,y^-))

Training loss は次のように書けます。

\mathcal{L}_{RM}=-\mathbb{E}_{(x,y^+,y^-)}\left[ \log\sigma(r_\phi(x,y^+)-r_\phi(x,y^-)) \right]

この式の気持ちは、「人間が好んだ response の score を、好まなかった response より高くする」というものです。Reward model は policy optimization の proxy になるため、calibration、overfitting、OOD prompt での信頼性が非常に重要です。

Reward model ensemble を使う場合、不確実性は score 分散として見られます。

\mathrm{Unc}(x,y)=\mathrm{Var}_{m}\left[r_{\phi_m}(x,y)\right]

不確実性が高い sample は、人間 annotation に回す active learning 候補になります。

主なソース

InstructGPT: https://arxiv.org/abs/2203.02155
Scaling Laws for Reward Model Overoptimization: https://arxiv.org/abs/2210.10760
Let's Verify Step by Step (PRM800K): https://arxiv.org/abs/2305.20050
Math-Shepherd: https://arxiv.org/abs/2312.08935

Bradley-Terry model​

Architecture​

Outcome RM と Process RM​

Reward model の落とし穴​

DPO との関係​

数式で見る reward model の pairwise training​

関連ページ​

主なソース​