Template
Qwen DAPO on Math-17K
A single-GPU template for RL fine-tuning a Qwen base model on competition math with DAPO, the open-source recipe from ByteDance and Tsinghua that beat DeepSeek R1 on AIME 2024 in half the training steps. Fork it, hit launch, and you get the full SFT then GRPO-style pipeline.
Train DAPO on Math-17K in one click
One sandbox, one GPU, an SFT warmup followed by DAPO RL on the 17K-problem math set. W&B logging on by default, so you can watch the reward curve climb live.
Training reward on DAPO-Math-17K, 25-step moving average. From an actual run of this template.
How DAPO works
DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is GRPO with four targeted fixes for long chain-of-thought RL. Same group sampling and verifiable rewards as GRPO, but each technique below patches a failure mode that otherwise stalls training.
Decouples the upper and lower PPO clip ranges so low-probability tokens can still grow, keeping exploration alive.
Oversamples prompts and drops groups where every rollout is right or every rollout is wrong, so each batch carries real signal.
Averages the policy-gradient loss over tokens, not samples, so long reasoning traces aren't down-weighted.
Applies a soft length-aware penalty to truncated samples instead of a hard zero, cutting reward noise.
Under the hood
| Algorithm | DAPO (decoupled clip + dynamic sampling), 8 rollouts per prompt |
| Base model | unsloth/Qwen3-4B-Base |
| Pipeline | SFT warmup on OpenMathReasoning, then DAPO RL |
| Dataset | DAPO-Math-17K (filtered to ~12.7K prompts) |
| Reward | Sparse correctness: 1 if final answer matches, else 0 |
| Hardware | 1 GPU (≥ 48 GB vRAM) |
Once it's running, stack child experiments to swap the base model, tune the clip ranges or group size, adjust the overlong penalty, or swap DAPO-Math for your own verifiable dataset.