Template
Qwen MaxRL on GSM8K
A single-GPU template for teaching a Qwen base model to solve grade-school math using MaxRL, Maximum Likelihood RL from CMU. A one-line change to GRPO's advantage that emphasizes hard problems and reverses pass@k degradation. Fork it, hit launch, and watch it train.
OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.
Train MaxRL on GSM8K in one click
One sandbox, one GPU, ~30 minutes to a fine-tuned Qwen-1.7B. MaxRL keeps the entire GRPO pipeline intact and only changes how the per-group advantage is normalized. It divides the centered reward by the group mean (pass rate) instead of its standard deviation. W&B logging on by default.
Training reward on GSM8K (raw, faint) with a 25-step moving average (bold). From an actual run of this template.
How MaxRL differs from GRPO
Standard RL maximizes the expected pass rate, which is only a first-order approximation of the true maximum-likelihood objective. MaxRL recovers the missing inverse-probability reweighting: where GRPO divides the centered reward by the group standard deviation, MaxRL divides by the group mean (the pass rate). This puts more weight on hard, low-pass-rate prompts, exactly where the learning signal matters. It also improves pass@k diversity instead of collapsing it.
Same rollouts, two normalizers. Dividing by the mean amplifies the gradient on the hard prompt (small pass rate) and damps it on the easy one. GRPO's std normalization does the opposite.
Under the hood
| Algorithm | MaxRL (mean-normalized GRPO), 32 rollouts per prompt |
| Objective | Divide centered reward by group mean (pass rate), not std. Recovers w(p) → 1/p reweighting |
| Base model | unsloth/Qwen3-1.7B-Base |
| Fine-tuning | LoRA adapters via Unsloth + vLLM rollouts |
| Reward | Sparse correctness: 1 if final answer matches, else 0 |
| Schedule | 500 steps, linear LR decay peaking at 5e-6 |
| Hardware | 1 GPU (≥ 24 GB vRAM) |
Once it's running, stack child experiments: bump the rollout group size (MaxRL optimizes a higher-order approximation of the likelihood as the group grows), flip the `MAXRL` flag to 0 to A/B against vanilla GRPO, or swap GSM8K for MATH, AIME, or your own verifiable dataset.