Threat Models and Notation

Adversarial attack を評価するときは、まず threat model を明示する必要があります。Threat model とは、攻撃者が何を知っていて、何を変更でき、何を成功と見なすかを定める条件です。これを曖昧にすると、防御の強さを比較できません。

基本記法

Classifier $f_\theta : \mathcal{X}\to\mathcal{Y}$ 、logit $z_\theta(x)$ 、loss $\ell(f_\theta(x),y)$ を考えます。

Untargeted attack は、正解 class $y$ での loss を最大化します。

\max_{\delta \in \Delta} \ell(f_\theta(x+\delta), y)

Targeted attack は、target class $t$ に分類させるため、target loss を最小化します。

\min_{\delta \in \Delta} \ell(f_\theta(x+\delta), t)

または、target class の logit を他 class より大きくする margin objective を使います。

Perturbation set

もっともよく使われる constraint は $L_p$ ball です。

\Delta_p(\epsilon)=\{\delta : \|\delta\|_p \le \epsilon,\; x+\delta \in [0,1]^d\}

Norm	意味	典型的な攻撃
$L_\infty$	各 pixel の最大変化量を制限	FGSM、PGD
$L_2$	全体の Euclidean distance を制限	DeepFool、CW $L_2$ 、randomized smoothing
$L_0$	変更する feature 数を制限	JSMA、one-pixel attack
Spatial / patch	位置、変形、貼り紙を制限	physical attack、adversarial patch

画像では $\epsilon=8/255$ のような $L_\infty$ 制約がよく使われますが、これは domain と preprocessing に強く依存します。

攻撃者の知識

Setting	攻撃者が知るもの	代表 method
White-box	model architecture、weight、gradient、defense	FGSM、PGD、CW、AutoAttack
Gray-box	training data や model family の一部	transfer attack、surrogate attack
Black-box score-based	input に対する confidence / logit を query できる	NES、SPSA、Square Attack
Black-box decision-based	最終 label だけを query できる	Boundary Attack、HopSkipJump

防御を評価するときは、white-box adaptive attack がもっとも重要です。防御を知らない攻撃だけで強いと主張すると、gradient masking を robustness と誤認する危険があります。

成功条件

Untargeted success は、予測が変わることです。

f_\theta(x+\delta) \ne y

Targeted success は、指定 target class に分類されることです。

f_\theta(x+\delta) = t

Robust accuracy は、各 sample について許容 perturbation 内の攻撃に耐えた割合です。

\mathrm{RobustAcc}(\epsilon)= \frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\left[ \forall \delta \in \Delta(\epsilon),\; f_\theta(x_i+\delta)=y_i \right]

実際には全探索できないため、強い attack で近似します。

Adaptive attack

Adaptive attack とは、防御手法を知った上で、その防御を含めて最適化する攻撃です。

たとえば、入力変換で防御しているなら、その変換を通した loss を攻撃します。非微分変換なら BPDA、randomized defense なら EOT を使います。

Evaluation protocol

Adversarial robustness の比較では、次を明示します。

Dataset と preprocessing
Norm と $\epsilon$
Attack algorithm と iteration 数
Random restarts 数
White-box / black-box の違い
Targeted / untargeted の違い
Defense を知った adaptive attack かどうか
Clean accuracy と robust accuracy の両方

数式で見る threat model

Threat model は、攻撃者が選べる perturbation 集合 $\mathcal{S}(x)$ として定義できます。

x'\in\mathcal{S}(x)

画像分類の代表例は norm-bounded perturbation です。

\mathcal{S}_p(x)=\{x+\delta:\|\delta\|_p\le\epsilon,\;x+\delta\in[0,1]^d\}

ここで、 $p$ は $2$ や $\infty$ がよく使われます。 $\ell_\infty$ は各 pixel の最大変化量を制限し、 $\ell_2$ は全体のエネルギーを制限します。

LLM や multimodal model では、perturbation は pixel norm ではなく、prompt rewrite、suffix、tool observation injection、image patch などになります。

\mathcal{S}_{text}(x)=\{\mathrm{Rewrite}(x;\eta):\eta\in\mathcal{A}\}

この式の気持ちは、「robustness は攻撃者に何を許すかを明確にしないと意味がない」ということです。評価では、norm、query budget、model access、semantic preservation を必ず明記する必要があります。

主なソース

Towards Evaluating the Robustness of Neural Networks: https://arxiv.org/abs/1608.04644
Towards Deep Learning Models Resistant to Adversarial Attacks: https://arxiv.org/abs/1706.06083
Obfuscated Gradients Give a False Sense of Security: https://arxiv.org/abs/1802.00420
AutoAttack: https://arxiv.org/abs/2003.01690

基本記法​

Perturbation set​

攻撃者の知識​

成功条件​

Adaptive attack​

Evaluation protocol​

数式で見る threat model​

関連ページ​

主なソース​