Learnable Dynamic Routing for Mixture of LoRA Experts
A principled, differentiable alternative to Top-K routing Learnable Dynamic Routing for Mixture of LoRA Experts
为混合 LoRA 专家设计的可学习动态路由机制
Combining Mixture-of-Experts (MoE) with Parameter-Efficient Fine-Tuning (PEFT, e.g. LoRA) has become a mainstream recipe for adapting large models. Yet existing MoLE (Mixture of LoRA Experts) methods are mostly chained to conventional Top-K routing. 混合专家模型(MoE)结合参数高效微调(PEFT,如 LoRA)已成为适配大模型的主流策略。然而,现有的 MoLE (Mixture of LoRA Experts) 方法大多受限于传统的 Top-K 路由。
The core idea of LD-MOLE is to reformulate routing as a projection onto the probability simplex and leverage the Sparsegen operator for a closed-form solution. LD-MOLE 的核心在于将路由问题转化为概率单纯形(Probability Simplex)上的投影问题,利用 Sparsegen 算法实现闭式解。
Given a token embedding $x \in \mathbb{R}^d$, the gating score vector $u \in \mathbb{R}^E$ is: 给定 Token 嵌入 $x \in \mathbb{R}^d$,门控分数向量 $u \in \mathbb{R}^E$ 为:
Instead of Top-K selection, we solve: 不再通过 Top-K 选择,而是求解以下优化问题:
where $\lambda$ is a token-specific sparsity factor predicted by a lightweight shared MLP: $\lambda = f(x)$. 其中 $\lambda$ 是由一个轻量级共享 MLP 预测的 Token 特异性稀疏因子:$\lambda = f(x)$。
By Proposition 1, the $i$-th component of the routing distribution $p$ is: 根据 Proposition 1,路由分布 $p$ 的第 $i$ 个分量为:
with threshold 阈值 $\tau$ 定义为:
where $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$ and $U_k$ is the prefix sum of sorted scores. This makes the entire routing procedure fully differentiable. 其中 $k = \max\{k \in [E] \mid 1 - \lambda + k u_{(k)} > U_k\}$,$U_k$ 为排序后分数的前 $k$ 项和。这种设计确保了路由过程 完全可微。
LD-MOLE predicts each token's resource demand on the fly during the forward pass: LD-MOLE 在前向传播中实时预测每个 Token 的资源需求:
Experiments span multiple model scales and task types to validate the generality of the conclusions. 实验跨越了多个模型规模和任务类型,确保了结论的普适性。
The total loss has three components: 总损失函数包含三部分:
LD-MOLE achieves the best average score across 9 benchmarks covering reasoning, commonsense, and language. LD-MOLE 在推理、常识、语言能力等 9 个基准测试中均取得最高平均分。
| Model模型 | Method方法 | MMLU Pro | ARC-C | HellaS | Avg平均分 |
|---|---|---|---|---|---|
| Llama-3.2-3B | MoLA (Top-2) | 42.31 | 71.91 | 87.31 | 78.72 |
| ReMoLE (ReLU) | 48.01 | 75.25 | 93.44 | 81.42 | |
| LD-MOLE (Ours) | 49.58 | 74.58 | 93.60 | 82.05 |
1. Mathematical elegance: Sparsegen turns the ad-hoc $K$ into a continuous learnable parameter, with solid theoretical backing. 1. 数学优雅性: 利用 Sparsegen 将原本"拍脑袋"定的 K 转化为可学习的连续参数,理论支撑非常扎实。
2. Dynamic allocation in practice: Genuinely token-level fine-grained resource allocation — very promising for long-context and complex tasks. 2. 动态性真正落地: 实现了真正的 Token 级别细粒度资源分配,对于长文本和复杂任务极具潜力。
3. Engineering completeness: Not just a method — the $\mathcal{L}_{sparse}$ loss enables explicit compute control, making it deployment-ready. 3. 工程完备: 不仅有方法论,还提供了专门的 $\mathcal{L}_{sparse}$ 来控制计算开销,具备实际部署价值。
1. MLP overhead: The paper claims the MLP is lightweight, but at ultra-large-scale inference, adding this micro-op to every layer may hurt throughput. A detailed latency analysis would help. 1. MLP 的额外开销: 虽然论文声称 MLP 很轻量,但在超大规模推理中,每一层增加的这个微小算子是否会影响整体吞吐量?建议补充更详细的 Latency 分析。
2. Cold-start stability: Routing crystallizes quickly during training (essentially frozen by end of Epoch 1). Is this a local optimum? Exploration mechanisms (e.g. noise perturbation) could help. 2. 冷启动稳定性: 路由在训练初期非常快地就固化了(Epoch 1 结束基本定型),这是否意味着其陷入了局部最优?可以尝试在初始阶段引入更多探索机制(如 Noise 扰动)。
3. Cross-task generalization: Experiments focus on NLU; routing behavior under code generation or long-context regimes is under-explored. 3. 跨任务泛化: 实验主要集中在 NLU 任务,对于代码生成或长上下文窗口的路由行为描述较少。
The paper shows routing ratios at Epoch 1 vs. Epoch 10. There is no "big reshuffle" — training mainly fine-tunes an established structure. This stability is both an asset (fast training) and a limit (small search space). 论文中展示了路由比例在 Epoch 1 和 Epoch 10 的变化。我们可以看到,训练过程并没有"大洗牌",而是对既有结构的微调。这种稳定性既是优势(训练快),也是限制(搜索空间小)。