← Back to Blog ← 返回博客
Published at ICLR 2026 ICLR 2026 发表

LD-MOLE

Learnable Dynamic Routing for Mixture of LoRA Experts
A principled, differentiable alternative to Top-K routing
Learnable Dynamic Routing for Mixture of LoRA Experts
为混合 LoRA 专家设计的可学习动态路由机制

$ \text{Learnable} \rightarrow \text{Differentiable} \rightarrow \text{Dynamic} $

Motivation: Breaking the Top-K Shackles 研究动机:打破 Top-K 的枷锁

Combining Mixture-of-Experts (MoE) with Parameter-Efficient Fine-Tuning (PEFT, e.g. LoRA) has become a mainstream recipe for adapting large models. Yet existing MoLE (Mixture of LoRA Experts) methods are mostly chained to conventional Top-K routing. 混合专家模型(MoE)结合参数高效微调(PEFT,如 LoRA)已成为适配大模型的主流策略。然而,现有的 MoLE (Mixture of LoRA Experts) 方法大多受限于传统的 Top-K 路由

Observed Problems 发现的问题

  • Non-differentiability: Top-K selection is discrete and non-differentiable, blocking end-to-end joint optimization. 不可微性: Top-K 选择操作是离散且不可微的,阻碍了端到端的联合优化。
  • Fixed budget: Every token gets the same number of experts, whether it is trivial or hard — no flexibility. 固定开销: 无论 Token 简单还是复杂,每个 Token 都会分配固定数量的专家,缺乏灵活性。
  • Hyperparameter trap: $K$ requires careful manual tuning and is often suboptimal across layers. 超参数陷阱: $K$ 的选取需要精细的人工调优,且在不同层级中往往是次优的。
LD-MOLE architecture overview
Fig. 1: LD-MOLE overview — learnable dynamic routing embedded inside LoRA adapters. 图 1: LD-MOLE 架构概览,展示了集成在 LoRA 适配器中的可学习动态路由。
Significance: LD-MOLE introduces a learnable sparsity parameter that delivers token-aware, layer-aware dynamic expert allocation, improving performance while staying efficient. Significance: 本文提出的 LD-MOLE 通过引入可学习的稀疏性参数,实现了 Token 相关、层相关的动态专家分配,在保持高效的同时显著提升了性能。

Formulation: From Discrete to Continuous 数学表示与建模:从离散到连续

The core idea of LD-MOLE is to reformulate routing as a projection onto the probability simplex and leverage the Sparsegen operator for a closed-form solution. LD-MOLE 的核心在于将路由问题转化为概率单纯形(Probability Simplex)上的投影问题,利用 Sparsegen 算法实现闭式解。

1. Expert Score Computation 1. 专家分数计算

Given a token embedding $x \in \mathbb{R}^d$, the gating score vector $u \in \mathbb{R}^E$ is: 给定 Token 嵌入 $x \in \mathbb{R}^d$,门控分数向量 $u \in \mathbb{R}^E$ 为:

$$ u = W_{gate}x $$

2. Learnable Dynamic Routing (Sparsegen) 2. 可学习的动态路由 (Sparsegen)

Instead of Top-K selection, we solve: 不再通过 Top-K 选择,而是求解以下优化问题:

$$ p = \arg\min_{p \in \mathbb{R}^E} \|p - u\|_2^2 - \lambda \|p\|_2^2 \quad \text{s.t.} \quad p \ge 0, \mathbf{1}^\top p = 1, \lambda < 1 $$

where $\lambda$ is a token-specific sparsity factor predicted by a lightweight shared MLP: $\lambda = f(x)$. 其中 $\lambda$ 是由一个轻量级共享 MLP 预测的 Token 特异性稀疏因子:$\lambda = f(x)$。

3. Closed-Form Solution & Threshold 3. 闭式解与阈值计算

By Proposition 1, the $i$-th component of the routing distribution $p$ is: 根据 Proposition 1,路由分布 $p$ 的第 $i$ 个分量为:

$$ p_i = \left[ \frac{u_i - \tau}{1 - \lambda} \right]_+ $$

with threshold 阈值 $\tau$ 定义为:

$$ \tau = \frac{U_k - 1 + \lambda}{k} $$

where $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$ and $U_k$ is the prefix sum of sorted scores. This makes the entire routing procedure fully differentiable. 其中 $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$,$U_k$ 为排序后分数的前 $k$ 项和。这种设计确保了路由过程 完全可微

Algorithm: The Art of Dynamic Allocation 算法流程:动态分配的艺术

LD-MOLE predicts each token's resource demand on the fly during the forward pass: LD-MOLE 在前向传播中实时预测每个 Token 的资源需求:

Algorithm: LD-MOLE Forward Pass
1. Input: Token features $x_t$, Pre-trained weights $W_{base}$
2. Logits: Compute gate scores $u = W_{gate}x_t$
3. Predict Sparsity: $\lambda_t = \text{MLP}_{shared}(x_t)$
4. Sort: $u_{(1)} \ge u_{(2)} \ge \dots \ge u_{(E)}$
5. Determine $k$: Find max $k$ satisfying $1 - \lambda + k u_{(k)} > U_k$
6. Compute Weights: $p_{t,i} = \max(0, (u_i - \tau)/(1-\lambda))$
7. Aggregate: $h_t = W_{base}x_t + \sum_{i=1}^E p_{t,i}(A_i B_i x_t)$
8. Return: Output embedding $h_t$
Routing detail
Fig. 2: Learnable dynamic routing vs. Top-K routing. 图 2: 可学习动态路由结构与 Top-K 路由的对比。

Experimental Setup: A Rigorous Foundation 实验方法与设计:严谨的复现基石

Experiments span multiple model scales and task types to validate the generality of the conclusions. 实验跨越了多个模型规模和任务类型,确保了结论的普适性。

Base Models 基座模型 (Base Models)

  • Qwen3-1.7B
  • Llama-3.2-3B
  • Llama-3.1-8B

Hyperparameters 超参数细节

  • LoRA Rank: $r=8$
  • LoRA Alpha: $\alpha_{lora}=16$
  • Number of experts: $E=8$专家数量: $E=8$
  • Batch Size: 16
  • Learning Rate: $1 \times 10^{-4}$
  • Epochs: 10 (3 for Llama-8B)训练轮数: 10 Epochs (Llama-8B 为 3)

Loss Function Design 损失函数设计

The total loss has three components: 总损失函数包含三部分:

$$ \mathcal{L}_{total} = \mathcal{L}_{LM} + \alpha \mathcal{L}_{lb} + \beta \mathcal{L}_{sparse} $$
  • $\mathcal{L}_{LM}$: standard cross-entropy LM loss.$\mathcal{L}_{LM}$: 标准交叉熵语言模型损失。
  • $\mathcal{L}_{lb}$: load-balancing loss, preventing routing collapse.$\mathcal{L}_{lb}$: 负载均衡损失,防止路由塌陷。
  • $\mathcal{L}_{sparse}$: novel analytic sparsity loss (Proposition 2) that pushes $\lambda$ into the target range.$\mathcal{L}_{sparse}$: 本文首创的分析稀疏损失,基于 Proposition 2 调节 $\lambda$ 到目标区间。

Results: Leading Across the Board 实验结果:全方位的领先

1. Core Performance Comparison 1. 核心性能对比

LD-MOLE achieves the best average score across 9 benchmarks covering reasoning, commonsense, and language. LD-MOLE 在推理、常识、语言能力等 9 个基准测试中均取得最高平均分。

Model模型 Method方法 MMLU Pro ARC-C HellaS Avg平均分
Llama-3.2-3B MoLA (Top-2) 42.31 71.91 87.31 78.72
ReMoLE (ReLU) 48.01 75.25 93.44 81.42
LD-MOLE (Ours) 49.58 74.58 93.60 82.05

2. Key Insights 2. 核心 Insights

  • Do deeper layers need more experts? $\lambda$ varies significantly across layers. Fig. 4 shows lower layers activate more experts, sparsifying with depth. 深层需要更多专家? 实验观察到 $\lambda$ 在不同层级有显著差异。图 4 显示,低层倾向于激活更多专家,随深度增加逐渐稀疏。
  • Rare tokens demand more experts: Fig. 5 reveals low-frequency (typically harder) tokens activate noticeably more experts than common ones. 难词需要更多专家: 图 5 揭示了一个有趣现象:低频词(通常更难)激活的专家数量明显高于高频词。
  • Zero-activation solved: Unlike ReLU-based routing (which can leave tokens with no experts), LD-MOLE theoretically guarantees at least one expert is active. 解决"零激活"问题: 相比 ReLU 路由可能导致的某些 Token 没有任何专家处理,LD-MOLE 理论上保证了至少激活一个专家。

Reviewer Rating: Strong Accept Sharp Critique 犀利锐评

Pros 优势 (Pros)

1. Mathematical elegance: Sparsegen turns the ad-hoc $K$ into a continuous learnable parameter, with solid theoretical backing. 1. 数学优雅性: 利用 Sparsegen 将原本"拍脑袋"定的 K 转化为可学习的连续参数,理论支撑非常扎实。

2. Dynamic allocation in practice: Genuinely token-level fine-grained resource allocation — very promising for long-context and complex tasks. 2. 动态性真正落地: 实现了真正的 Token 级别细粒度资源分配,对于长文本和复杂任务极具潜力。

3. Engineering completeness: Not just a method — the $\mathcal{L}_{sparse}$ loss enables explicit compute control, making it deployment-ready. 3. 工程完备: 不仅有方法论,还提供了专门的 $\mathcal{L}_{sparse}$ 来控制计算开销,具备实际部署价值。

Cons & Suggestions 不足 (Cons) & 改进建议

1. MLP overhead: The paper claims the MLP is lightweight, but at ultra-large-scale inference, adding this micro-op to every layer may hurt throughput. A detailed latency analysis would help. 1. MLP 的额外开销: 虽然论文声称 MLP 很轻量,但在超大规模推理中,每一层增加的这个微小算子是否会影响整体吞吐量?建议补充更详细的 Latency 分析。

2. Cold-start stability: Routing crystallizes quickly during training (essentially frozen by end of Epoch 1). Is this a local optimum? Exploration mechanisms (e.g. noise perturbation) could help. 2. 冷启动稳定性: 路由在训练初期非常快地就固化了(Epoch 1 结束基本定型),这是否意味着其陷入了局部最优?可以尝试在初始阶段引入更多探索机制(如 Noise 扰动)。

3. Cross-task generalization: Experiments focus on NLU; routing behavior under code generation or long-context regimes is under-explored. 3. 跨任务泛化: 实验主要集中在 NLU 任务,对于代码生成或长上下文窗口的路由行为描述较少。

One More Thing: Expert Activation Heatmap One More Thing: 专家激活的热力图分布

The paper shows routing ratios at Epoch 1 vs. Epoch 10. There is no "big reshuffle" — training mainly fine-tunes an established structure. This stability is both an asset (fast training) and a limit (small search space). 论文中展示了路由比例在 Epoch 1 和 Epoch 10 的变化。我们可以看到,训练过程并没有"大洗牌",而是对既有结构的微调。这种稳定性既是优势(训练快),也是限制(搜索空间小)。

Epoch 1 vs Epoch 10 routing heatmap
Routing activation patterns remain highly stable through training. 展示了专家激活模式在训练过程中的高度稳定性。