Qwen VPO on ToolRL

Trains a Qwen policy on the ToolRL function-calling task with VPO, Vector Policy Optimization. VPO trains the policy to produce a set of solutions that spread across the reward components rather than collapsing onto one answer, so a test-time search over the candidates keeps improving as the sample budget grows.

Train Qwen3-1.7B on the GPU of your choice and claim $30 in free compute credits on OpenResearch.

OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.

The training run

An SFT warmup teaches the multi-answer chain format, then GRPO optimizes a set-level reward. Each rollout emits M candidate answers in a single pass, and the set is scored by its best member under randomly sampled reward weightings, so spreading across reward components scores higher than restating one answer. W&B logging and a best@k eval are on by default.

4-dim
reward vector
1 GPU
≥ 24 GB vRAM
Qwen3-1.7B
base model
0.00.30.50.81.0050100150200TRAINING STEPREWARD

Set reward (smoothed) over the first 200 GRPO steps on ToolRL. From an actual run of this template.

How VPO differs from GRPO

Scalar RL optimizes one fixed reward weighting and concentrates probability mass onto a narrow set of responses , so additional samples are near-duplicates and search has nothing new to exploit. VPO keeps the reward vector-valued and, for each rollout, samples a random weighting from a Dirichlet simplex and rewards the set by its best member under that weighting. A collapsed set wins only in a narrow slice of the simplex, while a set spanning the Pareto frontier wins across many weightings, so the policy learns to keep diverse, specialized candidates.

Scalar GRPO: mode-collapsed set
reward dim Areward dim B
VPO: Pareto-spread set
reward dim Areward dim B

Each dot is a candidate's reward vector. GRPO clusters candidates in one corner; VPO spreads them along the frontier, so the best member under any deployment weighting is strong.

Under the hood

AlgorithmVPO (multi-answer chains + stochastic scalarization) on top of GRPO
ObjectiveSet reward R(S) = expected best-of-set scalarized reward over w ~ Dir(1), rewarding Pareto-frontier coverage
Base modelunsloth/Qwen3-1.7B
Fine-tuningLoRA adapters via Unsloth + vLLM rollouts
Reward4-dim ToolRL vector: format, tool-name F1, param-name F1, param-value correctness
ScheduleSFT warmup, then ~1000 GRPO steps, 16 rollouts × 3 candidates
Hardware1 GPU (≥ 24 GB vRAM)

Once it's running, stack child experiments: loosen the vLLM sampler to diversify candidates, add an explicit pairwise-L1 diversity bonus to the set reward, bump VPO_M / VPO_K, or swap ToolRL for any task with a genuinely multi-dimensional reward.