GRPO vs MaxRL on GSM8K (Qwen3-1.7B + LoRA): does normalizing the advantage by group mean instead of std help?