March 2025 — two heavyweight papers landed simultaneously. Despite almost identical names, their core agendas are strikingly different:
Search-R1 is a "general-framework" study; R1-Searcher is a "capability-shaping" study. 2025年3月,两篇重量级论文同时发布。虽然名字惊人地相似,但它们的核心侧重点截然不同:
Search-R1 是"通用框架型"研究,R1-Searcher 是"能力塑造型"研究。
| Dimension对比维度 | Search-R1 (UIUC/Google) | R1-Searcher (RUC) |
|---|---|---|
| Core Question核心问题 | How to unify "search + reasoning" inside an RL framework如何把"搜索 + 推理"统一进 RL 框架 | How to explicitly incentivize the model to learn to search如何显式激励模型学会搜索 |
| Paper Character论文气质 | Framework-oriented, methodology-driven框架化、方法论化 | Training-recipe-oriented, capability-shaping训练技巧化、能力塑形化 |
| Training Logic训练逻辑 | Single-stage outcome-based RL, mainly单阶段 outcome-based RL 为主 | Two-stage RL — first learn to search, then learn to answer with retrieved info两阶段 RL,先学会搜,再学会用搜到的信息答题 |
| Primary RL Algo主要 RL 算法 | PPO / GRPO | Modified Reinforce++ (with GRPO analysis)改造后的 Reinforce++,并分析 GRPO |
| Reward Design奖励设计 | Simple — mostly final-answer correctness (EM)很简单,主要看最终答案是否正确(EM) | Staged reward: retrieval reward first, then answer reward (F1)分阶段奖励:先 retrieval reward,再 answer reward(F1) |
| Task Coverage任务覆盖 | General QA + multi-hop QA, 7 datasets total一般 QA + 多跳 QA,共 7 个数据集 | Focused on multi-hop QA, 4 datasets更聚焦多跳 QA,共 4 个数据集 |
| Key Innovation强调的创新 | "Search engine as environment" RL formulation + retrieved token masking"search engine as environment" 的 RL formulation + retrieved token masking | Explicit incentive for "search capability" + two-stage training curriculum"search capability" 的显式激励 + 两阶段训练课程 |
| Overall Style最终风格 | Closer to a search-augmented RL framework更像 Search-augmented RL framework | Closer to an RL-trained search-agent recipe更像 RL-trained search agent recipe |
Existing methods either retrieve once after the input, or make the model "look like it can search" (tool-use) without truly learning the behavior during training. The authors aim to answer: 现有方法要么只是在输入后检索一次,要么让模型"看起来会搜"(Tool-use)但没在训练中真正学会。作者旨在解决:
👉 A "framework & optimization" flavored problem set 👉 问题意识更偏"框架与优化"
Existing LRMs rely on internal knowledge and hallucinate; prior methods generalize poorly or are too expensive. The authors aim to answer: 现有 LRM 依赖内部知识导致幻觉,而已有方法泛化不足或成本高。作者旨在解决:
👉 A "capability incentive & behavior shaping" flavored problem set 👉 问题意识更偏"能力激励与行为塑造"
Encourage the model to issue retrieval calls. Only format & action matter — not answer correctness. 鼓励模型发起检索。只看格式和动作,不看答案对错。
retrieval reward + format rewardKeep reasoning + retrieving, now caring about final-answer accuracy. 继续推理 + 检索,但关注最终答案是否准确。
answer reward(F1) + format penaltyThe shared secret both papers uncovered (Retrieved Tokens Loss Masking) 两篇论文同时发现的核心秘密(Retrieved Tokens Loss Masking)
When computing the RL gradient, text returned by the external search engine must be excluded. Without masking, the model will try to "predict" Wikipedia content, leading to extremely unstable optimization (reward hacking or collapse).
Search-R1's PPO objective states this clearly: 在强化学习计算梯度时,必须将外部搜索引擎返回的文本排除在外。如果不 Mask,模型会尝试去"预测"维基百科的内容,导致极其不稳定的优化(Reward Hacking 或崩溃)。
Search-R1 中的 PPO 目标函数清晰地表达了这一思想:
where $\textcolor{red}{\mathbb{I}(y_t)}$ is the core masking function: 其中 $\textcolor{red}{\mathbb{I}(y_t)}$ 就是核心的掩码函数:
💡 Both papers' ablations agree: without this mask, performance tanks or training collapses. 💡 两篇论文通过消融实验一致证明:没有这个 Mask,模型性能会大幅下降或训练崩溃。
| Dimension维度 | Search-R1 | R1-Searcher |
|---|---|---|
| Training Data训练数据 | NQ + HotpotQA mergedNQ + HotpotQA 合并 | HotpotQA + 2Wiki, with difficulty filteringHotpotQA + 2Wiki,且做 difficulty 选择 |
| Test Data测试数据 | NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, Bamboogle | HotpotQA, 2Wiki, Musique, Bamboogle |
| Task Type任务类型 | General QA + multi-hop QA一般 QA + 多跳 QA | Multi-hop QA as core以多跳 QA 为核心 |
| Test Models测试模型 | Qwen2.5-3B/7B Base/Instruct, plus 14BQwen2.5-3B/7B Base/Instruct,附带 14B 扩展 | Qwen2.5-7B-Base, Llama-3.1-8B-Instruct |
| Retrieval System检索系统 | 2018 Wikipedia + E5; default top-3 passages2018 Wikipedia + E5;默认 top-3 passages | KILT 2019 Wikipedia + BGE-large-en-v1.5; up to 8 retrievalsKILT 2019 Wikipedia + BGE-large-en-v1.5;最大检索 8 次 |
| Core Metric核心指标 | EM (Exact Match) | CEM + LLM-as-Judge |
| Online Search? | Mostly local retrieval environment主要是本地检索环境 | Also tested online Google API generalization专门测了在线 Google API 搜索泛化 |
Broad coverage and tight control. Tests both multi-hop and general QA. Fair comparison with same retriever / corpus / training data / pre-trained LLM. Compared against RAG, SFT, R1, and rejection sampling.
Headline result: Qwen2.5-7B averages a 24% relative improvement over the RAG baseline. Reads like a "complete benchmark + method validation" paper. 覆盖面广、控制严。不仅测了多跳,还有 General QA。统一设置 same retriever / same corpus / same training data / same pre-trained LLM 做公平比较。不仅和 RAG 比,还和 SFT、R1、rejection sampling 比。
主结果: 相比 RAG 基线,Qwen2.5-7B 平均相对提升 24%。它更像一篇"完整 benchmark + method validation paper"。
Stronger behavioral analysis, more engineering flavor. Beyond main results, it studies RL vs SFT, different answer rewards (F1 > EM), data difficulty distributions, and data diversity. Also tests the online-search scenario.
Headline result: Switching Bamboogle to online search yields further gains and beats Search-o1 32B. Reads like a "training recipe + behavior study" paper. 行为分析更强、工程味更重。不只是主结果,还分析了 RL vs SFT、不同 answer reward (F1 > EM)、数据难度分布、数据多样性等训练配方细节。还额外测试了在线搜索场景。
主结果: 在 Bamboogle 上切到 online search 后,比本地检索还能继续提升,且比 Search-o1 32B 更强。它更像一篇"训练 recipe + behavior study paper"。
Core claim: "Search-augmented RL is viable, trainable, generalizable, and analyzable." 核心定调:"search-augmented RL 是成立的,而且是可训练、可泛化、可分析的。"
Core claim: "If you really want to train a model into a search agent, reward design and training curriculum are critical." 核心定调:"如果你真想把模型训成会搜的 agent,奖励设计和训练课程很关键。"
If you're writing Related Work or Method Comparison, Search-R1 is the "canonical" choice — a framework archetype.
If you want to reproduce or build on top, R1-Searcher gives you better feel — full of hands-on lessons (why not search, why staged rewards, etc.).
Don't pit their benchmark scores against each other. In different ways, they jointly opened the door to the era of LLM autonomous retrieval. 如果你想写 Related Work 或 Method Comparison,Search-R1 更"规范",它是框架型代表。
如果你想亲手复现或进行后续开发,R1-Searcher 更"有手感",它提供了大量的实操踩坑经验(为何不搜、为何设阶段奖励等)。
不要直接比较谁的榜单分数更高,它们在用不同的方式,共同推开了大模型自主检索时代的大门。