RL Training For Math Reasoning

Sep 30, 2025

6 min read

RL Training For Math Reasoning

Introduction and Motivation

Reinforcement Learning (RL) algorithms, especially proximal policy optimization (PPO) and Group Relative Policy Optimization (GRPO), have proven to be essential for improving model capabilities in reasoning related tasks. In this blog we’d like to share the learnings and decision reasonings we experienced when developing RL infra as well as training math-reasoning models with RL. For illustration purpose, the results we show below are based on smaller open source models, but most of them apply to larger models as well.

The goal of the RL model training exploration is two-folds: 1) share our lessons and learnings on how to train models to hit the state-of-the-art math reasoning performance. This equips the team with the right knowledge on data manipulation, data mixing recipes, training best practices, RL algorithm nuances, and general performance optimization experience. 2) Apply these learnings to real production use cases to improve Perplexity products.

A summary of the key findings:

Infrastructure: we’ve developed GRPO algorithm on torchtune library as well as the Nemo suite, with VLLM-based rollout integrated. Nemo will be our short term go-to infra for RL training, while we develop torchtune GRPO support, which will be our preferred infra in the longer-run, for self-contained maintenance (no external dependency) as well as simpler framework architecture.
Math dataset vested: gsm8k, math, NuminaMath, Open Reasoning Zero (ORZ), AIME series.
Math reasoning model training:
- Data mixture of different difficulty levels matters
- RL proves to be able to further improve large language models (LLMs) reasoning capability beyond supervised fine tuning (SFT)
- The capability of base model matters a lot. In particular, long-CoT capability of the base model is important for further scaling with RL.
- A good SFT starting checkpoint helps the above. Light SFT serves two purposes:
  - enable the RL base model to be comfortable with generating long-CoT responses. Without this ability, when RL forces the model to scale it’s response length, self-repeating collapse would happen.
  - teach some reasoning capability to the base model, enabling the RL process to start higher and learn faster from the start, compared to a model that only knows to generate long-CoT responses with weak reasoning capability.

Training infrastructure exploration

Despite several open source RL training frameworks being available, many of them do not fit our situation. Ideally we’d want the following properties:

Scales well with the large model sizes of our production models.
Good training speed optimizations.
Easy to implement new algorithms and maintain without too much external dependencies.
Simpler and extendable framework architecture design.
Ideally unify with SFT training framework.

Framework comparison

A comparison of the frameworks that we considered [the following table comparison was done in Feb 2025. Note that a lot of the missing algorithms were later implemented.]:

We chose Nemo Aligner as a short term option, and ruled out the rest due to the following reasons:

Nemo-Aligner: due to the most complete features already implemented, as well as partnership support from Nvidia, we chose this option as our short term focus. However, the complex setup with dependencies on multiple repos puts some overhead on maintenance.
torchtune: this is the SFT framework we use at Perplexity. The framework is elegantly designed and easy to extend in general. However, due to being fairly new, the framework lacks a lot of features to be added. We aim to shift to torchtune for RL in the long-run. Once we get Nemo-aligner to a good state, we will invest in maintaining an in-house version of torchtune with our own implementation of desired algorithms.
VeRL: although integrates both FSDP and more powerful Megatron-LM backend, the latter support is very limited due to the community’s demand being mostly on smaller models where FSDP is sufficient. FSDP generally has weaker support for tensor-parallelism, which is crucial for larger models, especially in RL training. However, VeRL quickly become a popular choice for the community, and has developed significantly in the recent months. Given its selling points on throughput optimizations, and multiple recent papers on reasoning/agentic model training based on this framework (e.g. [1], [2]), it’s worth revisiting this option in the near future.
openRLHF: popular in academic community. However, the DeepSpeed backend makes it less scalable to large models. We’ve ruled out this option.

Algorithm development and validation

In this section, we first provide a brief introduction to the GRPO algorithm and discuss the associated technical enhancements that contribute to its implementation complexity. Subsequently, we describe the infrastructure developed to address these challenges.

Comparison of PPO vs GRPO. Reference: https://arxiv.org/abs/2402.03300

PPO is a popular RL algorithm widely used for fine-tuning LLMs due to its simplicity and effectiveness, as depicted above. However, PPO’s reliance on a separate value model introduces significant computational and memory overhead.

To address this limitation, GRPO modifies PPO by removing the separate value model and introducing group-based advantage calculation, as illustrated in figure above. Specifically, generates multiple candidate responses for each input question, computes their reward scores collectively, and determines advantages relative to these grouped outputs. While such innovation simplify certain aspect, they introduce new implementation complexities, such as efficiently generating multiple long-sequence rollouts.

Implementation details

Despite similar RL algorithms (PPO, Reinforce) already in-place in Nemo-Aligner, it’s surprisingly time consuming to make GRPO work properly for our case. A summary of improvements we did include:

GRPO algorithm implementation;
A more robust KL-divergence estimator (details);
Incorporate a format reward and enhance rules for mathematical accuracy (covering both numerical answers and symbolic ground-truth expressions);
Work with Nvidia support team to integrate VLLM-based rollout, which improved rollout efficiency by 30% and also fixed the buggy TensorRT-LLM infra that had log-probability mismatch (see next point)
Log-probability alignment with HF
- Note: this effort was by-far the most time-consuming. In order to ensure code correctness, we adopt the following metric to verify the computed log-probabilities from Nemo-Aligner is correct:

where L is the length of a rollout, the first 𝓁𝑜𝑔𝓅ᵢ is the log-probability on the i-th token from Nemo-Aligner, and second 𝓁𝑜𝑔𝓅’ᵢ is the reference log-probability we get by manually running model.forward on a huggingface model directly. This metric needs to stay very close to 1.0. We went through several iterations of nemo model converter, repo update, and image rebuilding, to reduce the metric from 1e5 down to a normal range within [1, 1.05].

Multiple rounds of hyper-parameter searches to optimize for memory. Due to the fact that GRPO requires multiple samples per-prompt, as well as math reasoning rollouts are usually long, we often end up with cuda OOM issue. Hyper parameters, especially parallelism setup, needs to be carefully picked to ensure smooth training.
Minor issues with dataloader consumption early stopping, and tensor shape issues in corner cases.

Experiments

In this section, we present the experimental setup and results, building upon the previously described infrastructure.

Experimental setup

We evaluate the models on the MATH-500 dataset using the pass@1 metric defined as follows, and report results specifically for pass@1. During evaluation, we set the sampling temperature of 0.7 and a top-𝑝 value of 0.95 to generate k rollouts.

Pillionaut

RL Training For Math Reasoning

RL Training For Math Reasoning

Introduction and Motivation

Training infrastructure exploration

Framework comparison

Algorithm development and validation

Comparison of PPO vs GRPO. Reference: https://arxiv.org/abs/2402.03300

Implementation details

Experiments

Experimental setup

pillionaut