Qwen MaxRL on GSM8K

A single-GPU template for teaching a Qwen base model to solve grade-school math using MaxRL, Maximum Likelihood RL from CMU. A one-line change to GRPO's advantage that emphasizes hard problems and reverses pass@k degradation. Fork it, hit launch, and watch it train.

Launch templateLLearn about OpenResearch →

OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.

Train MaxRL on GSM8K in one click

One sandbox, one GPU, ~30 minutes to a fine-tuned Qwen-1.7B. MaxRL keeps the entire GRPO pipeline intact and only changes how the per-group advantage is normalized. It divides the centered reward by the group mean (pass rate) instead of its standard deviation. W&B logging on by default.

1 line

diff from GRPO

1 GPU

≥ 24 GB vRAM

Qwen-1.7B

base model

Training reward on GSM8K (raw, faint) with a 25-step moving average (bold). From an actual run of this template.

How MaxRL differs from GRPO

Standard RL maximizes the expected pass rate, which is only a first-order approximation of the true maximum-likelihood objective. MaxRL recovers the missing inverse-probability reweighting: where GRPO divides the centered reward by the group standard deviation, MaxRL divides by the group mean (the pass rate). This puts more weight on hard, low-pass-rate prompts, exactly where the learning signal matters. It also improves pass@k diversity instead of collapsing it.

Easy prompt (pass rate 0.9)

GRPO: A = (r − mean) / std

MaxRL: A = (r − mean) / mean

Hard prompt (pass rate 0.1)

GRPO: A = (r − mean) / std

MaxRL: A = (r − mean) / mean

Same rollouts, two normalizers. Dividing by the mean amplifies the gradient on the hard prompt (small pass rate) and damps it on the easy one. GRPO's std normalization does the opposite.

Under the hood

Algorithm	MaxRL (mean-normalized GRPO), 32 rollouts per prompt
Objective	Divide centered reward by group mean (pass rate), not std. Recovers w(p) → 1/p reweighting
Base model	unsloth/Qwen3-1.7B-Base
Fine-tuning	LoRA adapters via Unsloth + vLLM rollouts
Reward	Sparse correctness: 1 if final answer matches, else 0
Schedule	500 steps, linear LR decay peaking at 5e-6
Hardware	1 GPU (≥ 24 GB vRAM)

Once it's running, stack child experiments: bump the rollout group size (MaxRL optimizes a higher-order approximation of the likelihood as the group grows), flip the `MAXRL` flag to 0 to A/B against vanilla GRPO, or swap GSM8K for MATH, AIME, or your own verifiable dataset.

Launch template