Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

Sharpening or Discovery? The Role of RL in LLM Reasoning

3 minute read

Published: January 01, 2026

Reinforcement learning came roaring back in 2025. One big reason is the rise of “reasoning” post-training: models like OpenAI’s o1 series and DeepSeek’s R1 are widely seen as benefiting from RL-based training, with noticeable boosts on math, coding, and many other tasks.

But here’s the catch: what RL is actually doing in LLM post-training—especially in RLVR—still isn’t settled. A lot of 2025 research revolves around a deceptively simple question: is RL mostly polishing the base model’s existing knowledge (by shifting probabilities toward better trajectories), or can it truly push the model beyond its original reasoning boundary and unlock new behaviors? In this post, I’ll re-read these papers with that question in mind and share a more careful take on what we can conclude.

Sharpening or Discovery? The Role of RL in LLM Reasoning

A Quick Review of RLVR

We begin with a brief recap of Reinforcement Learning with Verifiable Rewards (RLVR). Conceptually, RLVR is closely related to RLHF: both apply reinforcement learning during LLM post-training. The key difference is that, in RLVR, the reward signal can be automatically verified, without relying on human preferences or learned reward models.

For many tasks, such verifiable rewards arise naturally. In mathematics, the model’s predicted answer \( \hat{y} \) can be directly compared against the ground-truth solution \( y \): if the answer is correct, the reward is set to 1; otherwise, it is 0. In coding tasks, correctness can be determined via compilation or unit tests. When combined with chain-of-thought prompting, this setup makes it straightforward to evaluate whether a model’s reasoning ultimately leads to a correct outcome.

Formally, we view a language model as a policy \( \pi_\theta(y \mid x) \), where \( x \) denotes an input prompt (e.g., a math problem) and \( y = (y_1, \ldots, y_T) \) is the generated response. RLVR assumes access to a verifier \( r(x, y) \in \{0, 1\} \), which checks whether the output is correct. Crucially, this reward is independent of human preferences or learned reward models; it depends only on whether the final answer matches a deterministic criterion. In this sense, RLVR uses an outcome reward, rather than a process-level reward that evaluates intermediate reasoning steps.

In practice, RLVR is typically optimized with policy-gradient methods such as PPO, GRPO, REINFORCE++, or DAPO. As one concrete example, the DeepSeek models adopt Group-Relative Policy Optimization (GRPO). GRPO optimizes the objective \[ \mathcal{L}{\mathrm{GRPO}}(\theta)=\mathbb{E}{x \sim \mathcal{X},\, y \sim \pi_\theta(\cdot \mid x)}[\hat{A}(x, y)]-\beta \cdot \mathbb{E}{x \sim \mathcal{X}}[\mathrm{KL}(\pi\theta(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x))]. \]

A few remarks help clarify this objective:

Advantage estimation.
The term \( \hat{A}(x, y) \) is an estimate of the advantage of a sampled response. While it is equivalent, in expectation, to maximizing reward, it substantially reduces variance. GRPO does not introduce a separate critic to estimate values. Instead, for a given prompt \( x \), it samples a group of \( G \) responses, obtains rewards \( \{r_1, \ldots, r_G\} \), and computes a relative advantage \[ \hat{A}(x, y)=\frac{r(x, y)-\mathrm{mean}(\{r_1, \ldots, r_G\})}{\mathrm{std}(\{r_1, \ldots, r_G\})}. \]
Behavior regularization via KL.
As in PPO, the KL term constrains updates to stay close to a reference policy \( \pi_{\mathrm{ref}} \), typically the pre-trained base model. This highlights a fundamental distinction between RL in LLMs and classical reinforcement learning. Traditional RL often starts from scratch and explicitly balances exploration and exploitation. In contrast, RL for LLMs begins from a strong pre-trained prior. While this prior is widely believed to be a major source of performance gains, it also imposes a potentially restrictive constraint—one that will become central to our later discussion.
On-policy learning with verifiable rewards.
These algorithms learn exclusively from on-policy samples, i.e., responses generated by the current model. From the perspective of verifiable rewards, the training objective effectively increases the log-likelihood of responses with correct answers while decreasing the likelihood of incorrect ones.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.