R1-Searcher vs Search-R1: A Tale of Two Cities in RL-RAG

At a Glance: The Essential Differences 一图看懂：两篇论文的本质差别

Dimension对比维度	Search-R1 (UIUC/Google)	R1-Searcher (RUC)
Core Question核心问题	How to unify "search + reasoning" inside an RL framework如何把"搜索 + 推理"统一进 RL 框架	How to explicitly incentivize the model to learn to search如何显式激励模型学会搜索
Paper Character论文气质	Framework-oriented, methodology-driven框架化、方法论化	Training-recipe-oriented, capability-shaping训练技巧化、能力塑形化
Training Logic训练逻辑	Single-stage outcome-based RL, mainly单阶段 outcome-based RL 为主	Two-stage RL — first learn to search, then learn to answer with retrieved info两阶段 RL，先学会搜，再学会用搜到的信息答题
Primary RL Algo主要 RL 算法	PPO / GRPO	Modified Reinforce++ (with GRPO analysis)改造后的 Reinforce++，并分析 GRPO
Reward Design奖励设计	Simple — mostly final-answer correctness (EM)很简单，主要看最终答案是否正确（EM）	Staged reward: retrieval reward first, then answer reward (F1)分阶段奖励：先 retrieval reward，再 answer reward（F1）
Task Coverage任务覆盖	General QA + multi-hop QA, 7 datasets total一般 QA + 多跳 QA，共 7 个数据集	Focused on multi-hop QA, 4 datasets更聚焦多跳 QA，共 4 个数据集
Key Innovation强调的创新	"Search engine as environment" RL formulation + retrieved token masking"search engine as environment" 的 RL formulation + retrieved token masking	Explicit incentive for "search capability" + two-stage training curriculum"search capability" 的显式激励 + 两阶段训练课程
Overall Style最终风格	Closer to a search-augmented RL framework更像 Search-augmented RL framework	Closer to an RL-trained search-agent recipe更像 RL-trained search agent recipe

Motivation: What Are They Actually Solving? 动机对比：它们到底各自在解决什么问题？

Search-R1: Establishing a General Paradigm Search-R1：建立通用范式

Existing methods either retrieve once after the input, or make the model "look like it can search" (tool-use) without truly learning the behavior during training. The authors aim to answer: 现有方法要么只是在输入后检索一次，要么让模型"看起来会搜"（Tool-use）但没在训练中真正学会。作者旨在解决：

How to integrate search behavior into RL rollouts?搜索行为怎样纳入 RL rollout？
How to interleave multi-turn retrieval with reasoning?多轮检索与推理怎样交替进行？
Is outcome reward alone sufficient to learn this capability?只靠 outcome reward 是否足够学出该能力？
How to keep RL stable once retrieved text is injected?引入检索文本后，RL 如何保持稳定？

👉 A "framework & optimization" flavored problem set 👉 问题意识更偏"框架与优化"

R1-Searcher: Shaping the Capability R1-Searcher：专攻能力塑造

Existing LRMs rely on internal knowledge and hallucinate; prior methods generalize poorly or are too expensive. The authors aim to answer: 现有 LRM 依赖内部知识导致幻觉，而已有方法泛化不足或成本高。作者旨在解决：

How to make the model actively realize it should search?怎么让模型主动意识到自己该搜？
How to teach the "search invocation format" and habit first?怎么让模型先掌握"搜索调用格式"和"习惯"？
Can RL alone (no SFT cold start) instill the capability?怎么不用 SFT cold start，仅靠 RL 就练出能力？

👉 A "capability incentive & behavior shaping" flavored problem set 👉 问题意识更偏"能力激励与行为塑造"

Methodology: Interleaved Generation & The Golden Trick 方法流程：交替生成与 The Golden Trick

Search-R1 Flow Search-R1 流程

Question $x$问题 $x$

↓

<think> reasoning推理

↓

if external knowledge needed若需要外部知识

↓

<search> query </search>

↓

↓

continue <think>继续 <think>

↓

<answer> final answer </answer>

R1-Searcher Flow R1-Searcher 流程

Stage 1: Learn to "search, in the right format" Stage 1: 学会"会搜、按格式搜"

Encourage the model to issue retrieval calls. Only format & action matter — not answer correctness. 鼓励模型发起检索。只看格式和动作，不看答案对错。

retrieval reward + format reward

Stage 2: Learn to "search correctly & use it right" Stage 2: 学会"搜得对、用得对"

Keep reasoning + retrieving, now caring about final-answer accuracy. 继续推理 + 检索，但关注最终答案是否准确。

answer reward(F1) + format penalty

The Golden Trick: Retrieved-Token Loss Masking The Golden Trick: 检索内容 Loss 掩码

The shared secret both papers uncovered (Retrieved Tokens Loss Masking) 两篇论文同时发现的核心秘密（Retrieved Tokens Loss Masking）

When computing the RL gradient, text returned by the external search engine must be excluded. Without masking, the model will try to "predict" Wikipedia content, leading to extremely unstable optimization (reward hacking or collapse).
Search-R1's PPO objective states this clearly: 在强化学习计算梯度时，必须将外部搜索引擎返回的文本排除在外。如果不 Mask，模型会尝试去"预测"维基百科的内容，导致极其不稳定的优化（Reward Hacking 或崩溃）。
Search-R1 中的 PPO 目标函数清晰地表达了这一思想：

✨ Gemini AI:

Generating explanation...正在生成解释...

$$\mathcal{J}_{PPO}(\theta) = \mathbb{E} \left[ \sum_{t=1}^{|y|} \textcolor{red}{\mathbb{I}(y_t)} \min \left( \frac{\pi_\theta(y_t | x, y_{<t}; \mathcal{R})}{\pi_{old}(y_t | x, y_{<t}; \mathcal{R})} \hat{A}_t, \text{clip}(\dots) \right) \right]$$

where $\textcolor{red}{\mathbb{I}(y_t)}$ is the core masking function: 其中 $\textcolor{red}{\mathbb{I}(y_t)}$ 就是核心的掩码函数：

$$\mathbb{I}(y_t) = \begin{cases} 1, & \text{if } y_t \text{ is an LLM-generated token} \\ 0, & \text{if } y_t \text{ is a retrieved document token} \end{cases}$$ $$\mathbb{I}(y_t) = \begin{cases} 1, & \text{如果 } y_t \text{ 是 LLM 生成的 Token} \\ 0, & \text{如果 } y_t \text{ 是检索返回的文档 Token} \end{cases}$$

💡 Both papers' ablations agree: without this mask, performance tanks or training collapses. 💡 两篇论文通过消融实验一致证明：没有这个 Mask，模型性能会大幅下降或训练崩溃。

Experiments: Who Tests Broader? Who Tests Deeper? 实验设置：谁测得广？谁测得深？

Dimension维度	Search-R1	R1-Searcher
Training Data训练数据	NQ + HotpotQA mergedNQ + HotpotQA 合并	HotpotQA + 2Wiki, with difficulty filteringHotpotQA + 2Wiki，且做 difficulty 选择
Test Data测试数据	NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, Musique, Bamboogle	HotpotQA, 2Wiki, Musique, Bamboogle
Task Type任务类型	General QA + multi-hop QA一般 QA + 多跳 QA	Multi-hop QA as core以多跳 QA 为核心
Test Models测试模型	Qwen2.5-3B/7B Base/Instruct, plus 14BQwen2.5-3B/7B Base/Instruct，附带 14B 扩展	Qwen2.5-7B-Base, Llama-3.1-8B-Instruct
Retrieval System检索系统	2018 Wikipedia + E5; default top-3 passages2018 Wikipedia + E5；默认 top-3 passages	KILT 2019 Wikipedia + BGE-large-en-v1.5; up to 8 retrievalsKILT 2019 Wikipedia + BGE-large-en-v1.5；最大检索 8 次
Core Metric核心指标	EM (Exact Match)	CEM + LLM-as-Judge
Online Search?	Mostly local retrieval environment主要是本地检索环境	Also tested online Google API generalization专门测了在线 Google API 搜索泛化

Search-R1: Rigorous Control Search-R1 实验特点：严谨控制

Broad coverage and tight control. Tests both multi-hop and general QA. Fair comparison with same retriever / corpus / training data / pre-trained LLM. Compared against RAG, SFT, R1, and rejection sampling.

Headline result: Qwen2.5-7B averages a 24% relative improvement over the RAG baseline. Reads like a "complete benchmark + method validation" paper. 覆盖面广、控制严。不仅测了多跳，还有 General QA。统一设置 same retriever / same corpus / same training data / same pre-trained LLM 做公平比较。不仅和 RAG 比，还和 SFT、R1、rejection sampling 比。

主结果： 相比 RAG 基线，Qwen2.5-7B 平均相对提升 24%。它更像一篇"完整 benchmark + method validation paper"。

R1-Searcher: Capability Validation R1-Searcher 实验特点：能力验证

Stronger behavioral analysis, more engineering flavor. Beyond main results, it studies RL vs SFT, different answer rewards (F1 > EM), data difficulty distributions, and data diversity. Also tests the online-search scenario.

Headline result: Switching Bamboogle to online search yields further gains and beats Search-o1 32B. Reads like a "training recipe + behavior study" paper. 行为分析更强、工程味更重。不只是主结果，还分析了 RL vs SFT、不同 answer reward (F1 > EM)、数据难度分布、数据多样性等训练配方细节。还额外测试了在线搜索场景。

主结果： 在 Bamboogle 上切到 online search 后，比本地检索还能继续提升，且比 Search-o1 32B 更强。它更像一篇"训练 recipe + behavior study paper"。

Conclusions: What Did Each One Prove? 最终结论：各自证明了什么？

Search-R1: A Robust RL Paradigm Search-R1：建立稳健的强化学习范式

A search engine can be formally and stably embedded into RL rollouts.search engine 可以被正式并稳定地纳入 RL rollout。
Outcome reward alone can teach effective search-and-reasoning behavior.只用 outcome reward 也能学出有效的 search-and-reasoning。
Retrieved-token masking is the key stabilizer.retrieved token masking 是关键稳定因素。
Both PPO and GRPO work, but PPO is more stable.PPO 和 GRPO 都可行，但 PPO 更稳定。

Core claim: "Search-augmented RL is viable, trainable, generalizable, and analyzable." 核心定调："search-augmented RL 是成立的，而且是可训练、可泛化、可分析的。"

R1-Searcher: A Practical Playbook for Autonomous Search Agents R1-Searcher：打造自主搜索 Agent 的实战指南

Pure RL (no distillation/SFT cold start) can also elicit the search capability.仅靠 RL（无 distillation/SFT cold start），也能把搜索能力学出来。
Two-stage reward design works; F1 beats EM/CEM as answer reward.两阶段奖励设计有效；F1 比 EM/CEM 更适合 answer reward。
Data difficulty and diversity strongly shape search behavior.数据难度和数据多样性会显著影响 search behavior 的形成。
The learned capability transfers seamlessly to online search.学到的能力可以无缝迁移到 Online Search。

Core claim: "If you really want to train a model into a search agent, reward design and training curriculum are critical." 核心定调："如果你真想把模型训成会搜的 agent，奖励设计和训练课程很关键。"

💡 A Researcher's Final Verdict 💡 研究者视角的最终判断

If you're writing Related Work or Method Comparison, Search-R1 is the "canonical" choice — a framework archetype.

If you want to reproduce or build on top, R1-Searcher gives you better feel — full of hands-on lessons (why not search, why staged rewards, etc.).

Don't pit their benchmark scores against each other. In different ways, they jointly opened the door to the era of LLM autonomous retrieval. 如果你想写 Related Work 或 Method Comparison，Search-R1 更"规范"，它是框架型代表。

如果你想亲手复现或进行后续开发，R1-Searcher 更"有手感"，它提供了大量的实操踩坑经验（为何不搜、为何设阶段奖励等）。

不要直接比较谁的榜单分数更高，它们在用不同的方式，共同推开了大模型自主检索时代的大门。

The Dawn of Autonomous Search in LLMs 大模型自主搜索的黎明