MaxRL Fable

Public

GRPO vs MaxRL on GSM8K (Qwen3-1.7B + LoRA): does normalizing the advantage by group mean instead of std help?

rehaanahmad2013/qwen-maxrl-b724706e
No experiments yet.