RLHF and Alignment | Haruk1y Wiki

📄️RLHF and Alignment Overview

RLHF、reward model、DPO 系の direct preference optimization、Constitutional AI、RLAIF の全体像を整理します。

Pairwise preference、ranking、point-wise rating、AI feedback の集め方と落とし穴を整理します。

Bradley-Terry model、reward model の training、reward hacking、process reward model を整理します。

InstructGPT 系の RLHF pipeline、reward model、KL penalty、value model、PPO 更新を整理します。

Direct Preference Optimization の導出、loss、PPO RLHF との対比、実装上の注意点を整理します。

DPO 系の代表 variants (IPO, KTO, ORPO, SimPO, cDPO) を整理します。

DPO を固定 dataset から一歩進め、policy sampling、judge、preference 更新を繰り返す iterative / online DPO を整理します。

Group Relative Policy Optimization、DeepSeek-R1 で reasoning RL の核となる GRPO を整理します。

GRPO を大規模 reasoning RL 向けに改良した DAPO、Clip-Higher、Dynamic Sampling、token-level loss、overlong reward shaping を整理します。

AI feedback を使った alignment、Constitutional AI、RLAIF、principle-based critique を整理します。

Reward hacking、Goodhart's law、sycophancy、jailbreak、deceptive alignment などの failure mode を整理します。