BackB

Template

Qwen DAPO on Math-17K

A single-GPU template for RL fine-tuning a Qwen base model on competition math with DAPO, the open-source recipe from ByteDance and Tsinghua that beat DeepSeek R1 on AIME 2024 in half the training steps. Fork it, hit launch, and you get the full SFT then GRPO-style pipeline.

Train DAPO on Math-17K in one click

One sandbox, one GPU, an SFT warmup followed by DAPO RL on the 17K-problem math set. W&B logging on by default, so you can watch the reward curve climb live.

~3 hr
wall clock
1 GPU
≥ 48 GB vRAM
Qwen-4B
base model
0.00.10.30.40.6030006000900010324TRAINING STEPREWARD

Training reward on DAPO-Math-17K, 25-step moving average. From an actual run of this template.

How DAPO works

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is GRPO with four targeted fixes for long chain-of-thought RL. Same group sampling and verifiable rewards as GRPO, but each technique below patches a failure mode that otherwise stalls training.

Clip-Higherfixes entropy collapse

Decouples the upper and lower PPO clip ranges so low-probability tokens can still grow, keeping exploration alive.

Dynamic Samplingfixes zero-gradient batches

Oversamples prompts and drops groups where every rollout is right or every rollout is wrong, so each batch carries real signal.

Token-Level Lossfixes long-CoT imbalance

Averages the policy-gradient loss over tokens, not samples, so long reasoning traces aren't down-weighted.

Overlong Shapingfixes truncation noise

Applies a soft length-aware penalty to truncated samples instead of a hard zero, cutting reward noise.

Under the hood

AlgorithmDAPO (decoupled clip + dynamic sampling), 8 rollouts per prompt
Base modelunsloth/Qwen3-4B-Base
PipelineSFT warmup on OpenMathReasoning, then DAPO RL
DatasetDAPO-Math-17K (filtered to ~12.7K prompts)
RewardSparse correctness: 1 if final answer matches, else 0
Hardware1 GPU (≥ 48 GB vRAM)

Once it's running, stack child experiments to swap the base model, tune the clip ranges or group size, adjust the overlong penalty, or swap DAPO-Math for your own verifiable dataset.