HomeH

Qwen VPO on ToolRL

Trains a Qwen policy on the ToolRL function-calling task with VPO, Vector Policy Optimization. VPO trains the policy to produce a set of solutions that spread across the reward components rather than collapsing onto one answer, so a test-time search over the candidates keeps improving as the sample budget grows.

Train Qwen3-1.7B on the GPU of your choice and claim $30 in free compute credits on OpenResearch.

Train ModelTLearn about OpenResearch →

OpenResearch lets you fork ML templates and launch experiments effortlessly. A project by alphaXiv.

The training run

An SFT warmup teaches the multi-answer chain format, then GRPO optimizes a set-level reward. Each rollout emits M candidate answers in a single pass, and the set is scored by its best member under randomly sampled reward weightings, so spreading across reward components scores higher than restating one answer. W&B logging and a best@k eval are on by default.

4-dim

reward vector

1 GPU

≥ 24 GB vRAM

Qwen3-1.7B

base model

Set reward (smoothed) over the first 200 GRPO steps on ToolRL. From an actual run of this template.

How VPO differs from GRPO

Scalar RL optimizes one fixed reward weighting and concentrates probability mass onto a narrow set of responses , so additional samples are near-duplicates and search has nothing new to exploit. VPO keeps the reward vector-valued and, for each rollout, samples a random weighting from a Dirichlet simplex and rewards the set by its best member under that weighting. A collapsed set wins only in a narrow slice of the simplex, while a set spanning the Pareto frontier wins across many weightings, so the policy learns to keep diverse, specialized candidates.

Scalar GRPO: mode-collapsed set

VPO: Pareto-spread set

Each dot is a candidate's reward vector. GRPO clusters candidates in one corner; VPO spreads them along the frontier, so the best member under any deployment weighting is strong.

Under the hood

Algorithm	VPO (multi-answer chains + stochastic scalarization) on top of GRPO
Objective	Set reward R(S) = expected best-of-set scalarized reward over w ~ Dir(1), rewarding Pareto-frontier coverage
Base model	unsloth/Qwen3-1.7B
Fine-tuning	LoRA adapters via Unsloth + vLLM rollouts
Reward	4-dim ToolRL vector: format, tool-name F1, param-name F1, param-value correctness
Schedule	SFT warmup, then ~1000 GRPO steps, 16 rollouts × 3 candidates
Hardware	1 GPU (≥ 24 GB vRAM)

Once it's running, stack child experiments: loosen the vLLM sampler to diversify candidates, add an explicit pairwise-L1 diversity bonus to the set reward, bump VPO_M / VPO_K, or swap ToolRL for any task with a genuinely multi-dimensional reward.

Train Model