Qwen GRPO on GSM8K

A single-GPU template for teaching a Qwen base model to solve grade-school math problems using GRPO, the RL algorithm behind DeepSeek-R1. Fork it, hit launch, and you get a working RL loop you can iterate on in minutes.

Launch templateLLearn about OpenResearch →

OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.

Train GRPO on GSM8K in one click

One sandbox, one GPU, ~30 minutes to a fine-tuned Qwen-1.7B that's measurably better at grade-school math. W&B logging on by default, so you can watch the reward curve climb live.

~30 min

wall clock

1 GPU

≥ 24 GB vRAM

Qwen-1.7B

base model

Training reward on GSM8K (raw, faint) with a 25-step moving average (bold). From an actual run of this template.

How GRPO + GSM8K works

GSM8K is a benchmark of 8.5K grade-school math word problems where the reward is trivially verifiable: parse the final number, 1 if it matches the gold answer, else 0. GRPO (Group Relative Policy Optimization, the RL algorithm behind DeepSeek-R1) turns that sparse signal into an update by sampling a group of completions per prompt and reinforcing the ones that beat the group average. No critic network, no reward model, no human labels.

Prompt

"Janet has 3 apples…"

→

Group of k rollouts

completion 1

r = 1+0.50

completion 2

r = 0−0.50

completion 3

r = 1+0.50

completion 4

r = 0−0.50

mean = 0.5 → advantage = r − mean

→

Policy update

↑ winners

↓ losers

Under the hood

Algorithm	GRPO (TRL GRPOTrainer), 16 rollouts per prompt
Base model	unsloth/Qwen3-1.7B-Base
Fine-tuning	LoRA adapters via Unsloth + vLLM rollouts
Reward	Sparse correctness: 1 if final answer matches, else 0
Schedule	1000 steps, cosine LR decay peaking at 5e-6
Hardware	1 GPU (≥ 24 GB vRAM)

Once it's running, stack child experiments to swap the base model, change reward shaping (format reward, partial credit, etc.), tune LR / KL / group size, or swap GSM8K for MATH, AIME, or your own verifiable dataset.

Launch template