LD-MOLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Motivation: Breaking the Top-K Shackles 研究动机：打破 Top-K 的枷锁

Combining Mixture-of-Experts (MoE) with Parameter-Efficient Fine-Tuning (PEFT, e.g. LoRA) has become a mainstream recipe for adapting large models. Yet existing MoLE (Mixture of LoRA Experts) methods are mostly chained to conventional Top-K routing. 混合专家模型（MoE）结合参数高效微调（PEFT，如 LoRA）已成为适配大模型的主流策略。然而，现有的 MoLE (Mixture of LoRA Experts) 方法大多受限于传统的 Top-K 路由。

Observed Problems 发现的问题

Non-differentiability: Top-K selection is discrete and non-differentiable, blocking end-to-end joint optimization. 不可微性： Top-K 选择操作是离散且不可微的，阻碍了端到端的联合优化。
Fixed budget: Every token gets the same number of experts, whether it is trivial or hard — no flexibility. 固定开销： 无论 Token 简单还是复杂，每个 Token 都会分配固定数量的专家，缺乏灵活性。
Hyperparameter trap: $K$ requires careful manual tuning and is often suboptimal across layers. 超参数陷阱： $K$ 的选取需要精细的人工调优，且在不同层级中往往是次优的。

Fig. 1: LD-MOLE overview — learnable dynamic routing embedded inside LoRA adapters. 图 1: LD-MOLE 架构概览，展示了集成在 LoRA 适配器中的可学习动态路由。

 Significance: LD-MOLE introduces a learnable sparsity parameter that delivers token-aware, layer-aware dynamic expert allocation, improving performance while staying efficient. Significance： 本文提出的 LD-MOLE 通过引入可学习的稀疏性参数，实现了 Token 相关、层相关的动态专家分配，在保持高效的同时显著提升了性能。 

Formulation: From Discrete to Continuous 数学表示与建模：从离散到连续

The core idea of LD-MOLE is to reformulate routing as a projection onto the probability simplex and leverage the Sparsegen operator for a closed-form solution. LD-MOLE 的核心在于将路由问题转化为概率单纯形（Probability Simplex）上的投影问题，利用 Sparsegen 算法实现闭式解。

1. Expert Score Computation 1. 专家分数计算

Given a token embedding $x \in \mathbb{R}^d$, the gating score vector $u \in \mathbb{R}^E$ is: 给定 Token 嵌入 $x \in \mathbb{R}^d$，门控分数向量 $u \in \mathbb{R}^E$ 为：

u = W_{gate}x

2. Learnable Dynamic Routing (Sparsegen) 2. 可学习的动态路由 (Sparsegen)

Instead of Top-K selection, we solve: 不再通过 Top-K 选择，而是求解以下优化问题：

p = \arg\min_{p \in \mathbb{R}^E} \|p - u\|_2^2 - \lambda \|p\|_2^2 \quad \text{s.t.} \quad p \ge 0, \mathbf{1}^\top p = 1, \lambda < 1

where $\lambda$ is a token-specific sparsity factor predicted by a lightweight shared MLP: $\lambda = f(x)$. 其中 $\lambda$ 是由一个轻量级共享 MLP 预测的 Token 特异性稀疏因子：$\lambda = f(x)$。

3. Closed-Form Solution & Threshold 3. 闭式解与阈值计算

By Proposition 1, the $i$-th component of the routing distribution $p$ is: 根据 Proposition 1，路由分布 $p$ 的第 $i$ 个分量为：

p_i = \left[ \frac{u_i - \tau}{1 - \lambda} \right]_+

with threshold 阈值 $\tau$ 定义为：

\tau = \frac{U_k - 1 + \lambda}{k}

where $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$ and $U_k$ is the prefix sum of sorted scores. This makes the entire routing procedure fully differentiable. 其中 $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$，$U_k$ 为排序后分数的前 $k$ 项和。这种设计确保了路由过程 完全可微。

Algorithm: The Art of Dynamic Allocation 算法流程：动态分配的艺术

LD-MOLE predicts each token's resource demand on the fly during the forward pass: LD-MOLE 在前向传播中实时预测每个 Token 的资源需求：

 Algorithm: LD-MOLE Forward Pass
Input: Token features $x_t$, Pre-trained weights $W_{base}$
Logits: Compute gate scores $u = W_{gate}x_t$
Predict Sparsity: $\lambda_t = \text{MLP}_{shared}(x_t)$
Sort: $u_{(1)} \ge u_{(2)} \ge \dots \ge u_{(E)}$
Determine $k$: Find max $k$ satisfying $1 - \lambda + k u_{(k)} > U_k$
Compute Weights: $p_{t,i} = \max(0, (u_i - \tau)/(1-\lambda))$
Aggregate: $h_t = W_{base}x_t + \sum_{i=1}^E p_{t,i}(A_i B_i x_t)$
Return: Output embedding $h_t$ 

Fig. 2: Learnable dynamic routing vs. Top-K routing. 图 2: 可学习动态路由结构与 Top-K 路由的对比。

Experimental Setup: A Rigorous Foundation 实验方法与设计：严谨的复现基石

Experiments span multiple model scales and task types to validate the generality of the conclusions. 实验跨越了多个模型规模和任务类型，确保了结论的普适性。

Base Models 基座模型 (Base Models)

Qwen3-1.7B
Llama-3.2-3B
Llama-3.1-8B

Hyperparameters 超参数细节

LoRA Rank: $r=8$
LoRA Alpha: $\alpha_{lora}=16$
Number of experts: $E=8$专家数量: $E=8$
Batch Size: 16
Learning Rate: $1 \times 10^{-4}$
Epochs: 10 (3 for Llama-8B)训练轮数: 10 Epochs (Llama-8B 为 3)

Loss Function Design 损失函数设计

The total loss has three components: 总损失函数包含三部分：

\mathcal{L}_{total} = \mathcal{L}_{LM} + \alpha \mathcal{L}_{lb} + \beta \mathcal{L}_{sparse}

$\mathcal{L}_{LM}$: standard cross-entropy LM loss.$\mathcal{L}_{LM}$: 标准交叉熵语言模型损失。
$\mathcal{L}_{lb}$: load-balancing loss, preventing routing collapse.$\mathcal{L}_{lb}$: 负载均衡损失，防止路由塌陷。
$\mathcal{L}_{sparse}$: novel analytic sparsity loss (Proposition 2) that pushes $\lambda$ into the target range.$\mathcal{L}_{sparse}$: 本文首创的分析稀疏损失，基于 Proposition 2 调节 $\lambda$ 到目标区间。

Results: Leading Across the Board 实验结果：全方位的领先

1. Core Performance Comparison 1. 核心性能对比

LD-MOLE achieves the best average score across 9 benchmarks covering reasoning, commonsense, and language. LD-MOLE 在推理、常识、语言能力等 9 个基准测试中均取得最高平均分。

Model模型	Method方法	MMLU Pro	ARC-C	HellaS	Avg平均分
Llama-3.2-3B	MoLA (Top-2)	42.31	71.91	87.31	78.72
	ReMoLE (ReLU)	48.01	75.25	93.44	81.42
	LD-MOLE (Ours)	49.58	74.58	93.60	82.05

2. Key Insights 2. 核心 Insights

Do deeper layers need more experts? $\lambda$ varies significantly across layers. Fig. 4 shows lower layers activate more experts, sparsifying with depth. 深层需要更多专家？ 实验观察到 $\lambda$ 在不同层级有显著差异。图 4 显示，低层倾向于激活更多专家，随深度增加逐渐稀疏。
Rare tokens demand more experts: Fig. 5 reveals low-frequency (typically harder) tokens activate noticeably more experts than common ones. 难词需要更多专家： 图 5 揭示了一个有趣现象：低频词（通常更难）激活的专家数量明显高于高频词。
Zero-activation solved: Unlike ReLU-based routing (which can leave tokens with no experts), LD-MOLE theoretically guarantees at least one expert is active. 解决"零激活"问题： 相比 ReLU 路由可能导致的某些 Token 没有任何专家处理，LD-MOLE 理论上保证了至少激活一个专家。

Reviewer Rating: Strong Accept Sharp Critique 犀利锐评

Pros 优势 (Pros)

1. Mathematical elegance: Sparsegen turns the ad-hoc $K$ into a continuous learnable parameter, with solid theoretical backing. 1. 数学优雅性： 利用 Sparsegen 将原本"拍脑袋"定的 K 转化为可学习的连续参数，理论支撑非常扎实。

2. Dynamic allocation in practice: Genuinely token-level fine-grained resource allocation — very promising for long-context and complex tasks. 2. 动态性真正落地： 实现了真正的 Token 级别细粒度资源分配，对于长文本和复杂任务极具潜力。

3. Engineering completeness: Not just a method — the $\mathcal{L}_{sparse}$ loss enables explicit compute control, making it deployment-ready. 3. 工程完备： 不仅有方法论，还提供了专门的 $\mathcal{L}_{sparse}$ 来控制计算开销，具备实际部署价值。

Cons & Suggestions 不足 (Cons) & 改进建议

1. MLP overhead: The paper claims the MLP is lightweight, but at ultra-large-scale inference, adding this micro-op to every layer may hurt throughput. A detailed latency analysis would help. 1. MLP 的额外开销： 虽然论文声称 MLP 很轻量，但在超大规模推理中，每一层增加的这个微小算子是否会影响整体吞吐量？建议补充更详细的 Latency 分析。

2. Cold-start stability: Routing crystallizes quickly during training (essentially frozen by end of Epoch 1). Is this a local optimum? Exploration mechanisms (e.g. noise perturbation) could help. 2. 冷启动稳定性： 路由在训练初期非常快地就固化了（Epoch 1 结束基本定型），这是否意味着其陷入了局部最优？可以尝试在初始阶段引入更多探索机制（如 Noise 扰动）。

3. Cross-task generalization: Experiments focus on NLU; routing behavior under code generation or long-context regimes is under-explored. 3. 跨任务泛化： 实验主要集中在 NLU 任务，对于代码生成或长上下文窗口的路由行为描述较少。

One More Thing: Expert Activation Heatmap One More Thing: 专家激活的热力图分布

The paper shows routing ratios at Epoch 1 vs. Epoch 10. There is no "big reshuffle" — training mainly fine-tunes an established structure. This stability is both an asset (fast training) and a limit (small search space). 论文中展示了路由比例在 Epoch 1 和 Epoch 10 的变化。我们可以看到，训练过程并没有"大洗牌"，而是对既有结构的微调。这种稳定性既是优势（训练快），也是限制（搜索空间小）。

Routing activation patterns remain highly stable through training. 展示了专家激活模式在训练过程中的高度稳定性。