Qwen VPO on ToolRL
Trains a Qwen policy on the ToolRL function-calling task with VPO, Vector Policy Optimization. VPO trains the policy to produce a set of solutions that spread across the reward components rather than collapsing onto one answer, so a test-time search over the candidates keeps improving as the sample budget grows.
Train Qwen3-1.7B on the GPU of your choice and claim $30 in free compute credits on OpenResearch.
OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.
The training run
An SFT warmup teaches the multi-answer chain format, then GRPO optimizes a set-level reward. Each rollout emits M candidate answers in a single pass, and the set is scored by its best member under randomly sampled reward weightings, so spreading across reward components scores higher than restating one answer. W&B logging and a best@k eval are on by default.
Set reward (smoothed) over the first 200 GRPO steps on ToolRL. From an actual run of this template.
How VPO differs from GRPO
Scalar RL optimizes one fixed reward weighting and concentrates probability mass onto a narrow set of responses , so additional samples are near-duplicates and search has nothing new to exploit. VPO keeps the reward vector-valued and, for each rollout, samples a random weighting from a Dirichlet simplex and rewards the set by its best member under that weighting. A collapsed set wins only in a narrow slice of the simplex, while a set spanning the Pareto frontier wins across many weightings, so the policy learns to keep diverse, specialized candidates.
Each dot is a candidate's reward vector. GRPO clusters candidates in one corner; VPO spreads them along the frontier, so the best member under any deployment weighting is strong.
Under the hood
| Algorithm | VPO (multi-answer chains + stochastic scalarization) on top of GRPO |
| Objective | Set reward R(S) = expected best-of-set scalarized reward over w ~ Dir(1), rewarding Pareto-frontier coverage |
| Base model | unsloth/Qwen3-1.7B |
| Fine-tuning | LoRA adapters via Unsloth + vLLM rollouts |
| Reward | 4-dim ToolRL vector: format, tool-name F1, param-name F1, param-value correctness |
| Schedule | SFT warmup, then ~1000 GRPO steps, 16 rollouts × 3 candidates |
| Hardware | 1 GPU (≥ 24 GB vRAM) |
Once it's running, stack child experiments: loosen the vLLM sampler to diversify candidates, add an explicit pairwise-L1 diversity bonus to the set reward, bump VPO_M / VPO_K, or swap ToolRL for any task with a genuinely multi-dimensional reward.