Mixture of Experts

Mixture of Experts (MoE) は、各 token に対して 一部の専門 sub-network (expert) だけを活性化 する sparse な architecture です。Mixtral、DeepSeek-V3、Qwen-MoE、GLaM、Switch Transformer など、近年の大規模 LLM の重要な設計です。

基本 idea

Dense な FFN を、複数の expert FFN + router に置き換えます。

各 token について、router が top-k expert (典型的には k=1〜2) を選び、その出力を加重和して使います。

y = \sum_{e \in \text{TopK}(x)} g_e(x) \cdot \text{Expert}_e(x)

なぜ効くのか

総 parameter は巨大だが、各 token が使うのは一部だけ
同じ FLOPs で、より多くのパラメータ を持てる
Expert ごとに異なるドメイン / pattern を学習しやすい
推論コストを抑えつつ、scaling を進められる

Load balancing

何も制約しないと、router は少数の expert に偏ります (一部の expert に過剰 routing)。これを避けるため、auxiliary load balancing loss を加え、expert 間の usage を均すのが標準です。

Switch Transformer、GShard、ST-MoE、DeepSeek-V3 などが異なる balancing 戦略を提案しています。DeepSeek-V3 では、auxiliary loss を使わずに routing bias を動的調整する設計も採用されています。

代表的 MoE LLM

Model	Expert 数	Active params / total
Switch Transformer	数千	大規模実験
GLaM	64	64B / 1.2T
Mixtral 8x7B	8	12.9B active / 46.7B total
DeepSeek-V3	256 + 1 shared	37B active / 671B total

課題

Inference 時の memory 要件は total parameter ぶん必要
Expert parallelism のための分散実装が複雑
Routing の不安定性
Domain shift で偏った routing
KV cache や batch との相互作用

数式で見る router と load balancing

MoE layer では、各 token の hidden state $\mathbf{x}$ に対して router が expert の確率を出します。

p(e\mid \mathbf{x})=\mathrm{softmax}(\mathbf{W}_r\mathbf{x})_e

Top-k routing では、確率の高い expert 集合を選びます。

\mathcal{E}_k(\mathbf{x})=\operatorname{TopK}_e\,p(e\mid \mathbf{x})

MoE layer の出力は、選ばれた expert の加重和です。

\mathbf{y}=\sum_{e\in\mathcal{E}_k(\mathbf{x})}p(e\mid\mathbf{x})\,\mathrm{Expert}_e(\mathbf{x})

何も制約しないと、router は一部の expert に token を集中させがちです。そこで、expert $e$ に実際に送られた token の割合を $f_e$ 、router probability の平均を $p_e$ とすると、load balancing loss は概念的に次のように書けます。

\mathcal{L}_{aux}=E\sum_{e=1}^{E} f_e p_e

この式の気持ちは、「router が確率としても実際の割り当てとしても、特定 expert に偏りすぎないようにする」ということです。MoE は total parameter を増やせますが、routing が偏ると一部 expert だけが過負荷になり、品質も効率も落ちます。

主なソース

Switch Transformer: https://arxiv.org/abs/2101.03961
GShard: https://arxiv.org/abs/2006.16668
Mixtral 8x7B: https://arxiv.org/abs/2401.04088
DeepSeek-V3 technical report: https://arxiv.org/abs/2412.19437

基本 idea​

なぜ効くのか​

Load balancing​

代表的 MoE LLM​

課題​

数式で見る router と load balancing​

関連ページ​

主なソース​