Dark | Light
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![WenSun1 Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::824609918.png) Wen Sun [@WenSun1](/creator/twitter/WenSun1) on x XXX followers
Created: 2025-07-16 02:45:37 UTC

Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better? 

Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you deploy algorithms like Reinforce and REBEL (a generalization of Natural Policy Gradient), RL does not learn under random rewards. These two simple algorithms simply behave as we would expect in this case. 

GRPO and PPO indeed can behave strangely. They can learn positively or negatively, depending on different random seeds. The clipping heuristic introduces certain bias in the objective function, which causes such unexpected behaviors (this even happens in bandit which has nothing to do w/ LLM or reasoning). Perhaps it is time to abandon the clipping heuristic...


XXXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1945313845804724697/c:line.svg)

**Related Topics**
[qwen](/topic/qwen)
[rl](/topic/rl)
[wen](/topic/wen)

[Post Link](https://x.com/WenSun1/status/1945313845804724697)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

WenSun1 Avatar Wen Sun @WenSun1 on x XXX followers Created: 2025-07-16 02:45:37 UTC

Does RL actually learn positively under random rewards when optimizing Qwen on MATH? Is Qwen really that magical such that even RLing on random rewards can make it reason better?

Following prior work on spurious rewards on RL, we ablated algorithms. It turns out that if you deploy algorithms like Reinforce and REBEL (a generalization of Natural Policy Gradient), RL does not learn under random rewards. These two simple algorithms simply behave as we would expect in this case.

GRPO and PPO indeed can behave strangely. They can learn positively or negatively, depending on different random seeds. The clipping heuristic introduces certain bias in the objective function, which causes such unexpected behaviors (this even happens in bandit which has nothing to do w/ LLM or reasoning). Perhaps it is time to abandon the clipping heuristic...

XXXXXX engagements

Engagements Line Chart

Related Topics qwen rl wen

Post Link

post/tweet::1945313845804724697
/post/tweet::1945313845804724697