Template
Qwen GRPO on GSM8K
A single-GPU template for teaching a Qwen base model to solve grade-school math problems using GRPO, the RL algorithm behind DeepSeek-R1. Fork it, hit launch, and you get a working RL loop you can iterate on in minutes.
OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.
Train GRPO on GSM8K in one click
One sandbox, one GPU, ~30 minutes to a fine-tuned Qwen-1.7B that's measurably better at grade-school math. W&B logging on by default, so you can watch the reward curve climb live.
Training reward on GSM8K (raw, faint) with a 25-step moving average (bold). From an actual run of this template.
How GRPO + GSM8K works
GSM8K is a benchmark of 8.5K grade-school math word problems where the reward is trivially verifiable: parse the final number, 1 if it matches the gold answer, else 0. GRPO (Group Relative Policy Optimization, the RL algorithm behind DeepSeek-R1) turns that sparse signal into an update by sampling a group of completions per prompt and reinforcing the ones that beat the group average. No critic network, no reward model, no human labels.
Under the hood
| Algorithm | GRPO (TRL GRPOTrainer), 16 rollouts per prompt |
| Base model | unsloth/Qwen3-1.7B-Base |
| Fine-tuning | LoRA adapters via Unsloth + vLLM rollouts |
| Reward | Sparse correctness: 1 if final answer matches, else 0 |
| Schedule | 1000 steps, cosine LR decay peaking at 5e-6 |
| Hardware | 1 GPU (≥ 24 GB vRAM) |
Once it's running, stack child experiments to swap the base model, change reward shaping (format reward, partial credit, etc.), tune LR / KL / group size, or swap GSM8K for MATH, AIME, or your own verifiable dataset.