Robustness Evaluation

Robustness evaluation は、防御手法が本当に adversarial attack に強いかを評価するための手順です。Adversarial robustness の歴史では、弱い attack で評価したために robustness を過大評価した例が多くあります。そのため、評価 protocol が極めて重要です。

Clean accuracy と robust accuracy

Clean accuracy は通常入力での正解率です。

\mathrm{CleanAcc}=\frac{1}{n}\sum_i \mathbf{1}[f(x_i)=y_i]

Robust accuracy は、攻撃後も正解する割合です。

\mathrm{RobustAcc}_{\mathcal{A}}=\frac{1}{n}\sum_i \mathbf{1}[f(\mathcal{A}(x_i,y_i))=y_i]

ここで $\mathcal{A}$ は attack algorithm です。厳密な robust accuracy は perturbation set 全体の worst-case で定義されますが、実際には強い attack で近似します。

AutoAttack

AutoAttack は、手動 tuning を減らし、標準化された強い attack ensemble で robustness を評価するための benchmark attack です。主な component は次の通りです。

Component	種類	内容
APGD-CE	white-box	cross-entropy loss に対する Auto-PGD
APGD-DLR	white-box	Difference of Logits Ratio loss に対する Auto-PGD
FAB	white-box	decision boundary に近づく attack
Square Attack	black-box score-based	gradient-free な square perturbation attack

AutoAttack は、単一の PGD 設定に依存しないため、防御評価の baseline として広く使われます。

Gradient masking の検出

Gradient masking は、防御が実際に robust なのではなく、gradient attack を妨害しているだけの状態です。

次の兆候がある場合は疑います。

Attack iteration を増やしても loss が上がらない。
White-box attack より black-box attack の方が強い。
Random restart 数を増やすと robust accuracy が大きく下がる。
Non-differentiable preprocessing がある。
Stochastic defense に EOT を使っていない。

Adaptive attack の必要性

防御 $D$ がある場合、攻撃は $f(D(x))$ 全体を対象にする必要があります。

\max_{\delta\in\Delta}\ell(f_\theta(D(x+\delta)),y)

非微分な $D$ に対しては Backward Pass Differentiable Approximation (BPDA) を使い、randomized defense に対しては Expectation over Transformation (EOT) を使います。

\max_{\delta\in\Delta}\mathbb{E}_{\omega}\left[\ell(f_\theta(D_\omega(x+\delta)),y)\right]

報告すべき項目

項目	理由
Clean accuracy	Robustness との trade-off を見るため
Robust accuracy	攻撃後の性能を測るため
Norm と epsilon	threat model を明確にするため
Attack steps / restarts	攻撃強度を再現可能にするため
Targeted / untargeted	成功条件が異なるため
White-box / black-box	攻撃者の知識が異なるため
Adaptive attack	防御を知った攻撃に耐えるかを見るため
Confidence interval	評価 sample 数の不確実性を示すため

RobustBench

RobustBench は、標準 dataset、norm、epsilon、attack protocol に基づいて robust model を比較する benchmark です。CIFAR-10、CIFAR-100、ImageNet などで $L_\infty$ / $L_2$ robustness を比較できます。

LLM / Agent の評価との違い

LLM adversarial evaluation では、入力が離散 token であり、成功条件も unsafe output、policy violation、tool misuse、data exfiltration などになります。そのため、画像の $L_p$ ball に相当する単純な norm 制約は使いにくく、prompt distribution、attack budget、judge model、人間評価を明示する必要があります。

主なソース

AutoAttack: https://arxiv.org/abs/2003.01690
Obfuscated Gradients Give a False Sense of Security: https://arxiv.org/abs/1802.00420
On Evaluating Adversarial Robustness: https://arxiv.org/abs/1902.06705
RobustBench: https://robustbench.github.io/
Adversarial Robustness - Theory and Practice: https://adversarial-ml-tutorial.org/

Clean accuracy と robust accuracy​

AutoAttack​

Gradient masking の検出​

Adaptive attack の必要性​

報告すべき項目​

RobustBench​

LLM / Agent の評価との違い​

関連ページ​

主なソース​