Template

Qwen GRPO on GSM8K

A single-GPU template for teaching a Qwen base model to solve grade-school math problems using GRPO, the RL algorithm behind DeepSeek-R1. Fork it, hit launch, and you get a working RL loop you can iterate on in minutes.

OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.

Train GRPO on GSM8K in one click

One sandbox, one GPU, ~30 minutes to a fine-tuned Qwen-1.7B that's measurably better at grade-school math. W&B logging on by default, so you can watch the reward curve climb live.

~30 min
wall clock
1 GPU
≥ 24 GB vRAM
Qwen-1.7B
base model
0.00.30.50.81.00100020002998TRAINING STEPREWARD

Training reward on GSM8K (raw, faint) with a 25-step moving average (bold). From an actual run of this template.

How GRPO + GSM8K works

GSM8K is a benchmark of 8.5K grade-school math word problems where the reward is trivially verifiable: parse the final number, 1 if it matches the gold answer, else 0. GRPO (Group Relative Policy Optimization, the RL algorithm behind DeepSeek-R1) turns that sparse signal into an update by sampling a group of completions per prompt and reinforcing the ones that beat the group average. No critic network, no reward model, no human labels.

Prompt
"Janet has 3 apples…"
Group of k rollouts
completion 1
r = 1+0.50
completion 2
r = 0−0.50
completion 3
r = 1+0.50
completion 4
r = 0−0.50
mean = 0.5 → advantage = r − mean
Policy update
↑ winners
↓ losers

Under the hood

AlgorithmGRPO (TRL GRPOTrainer), 16 rollouts per prompt
Base modelunsloth/Qwen3-1.7B-Base
Fine-tuningLoRA adapters via Unsloth + vLLM rollouts
RewardSparse correctness: 1 if final answer matches, else 0
Schedule1000 steps, cosine LR decay peaking at 5e-6
Hardware1 GPU (≥ 24 GB vRAM)

Once it's running, stack child experiments to swap the base model, change reward shaping (format reward, partial credit, etc.), tune LR / KL / group size, or swap GSM8K for MATH, AIME, or your own verifiable dataset.