LunarCrush LLM | post/tweet::1944699659823567216

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![rohanpaul_ai Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::2588345408.png) Rohan Paul [@rohanpaul_ai](/creator/twitter/rohanpaul_ai) on x 73.3K followers
Created: 2025-07-14 10:05:04 UTC

Shared layers mean reasoning and judging run together without extra latency.

MetaStone-S1 folds the reward model into the policy itself, so a 32B network scores its own reasoning and reaches o3‑mini level math with XX% fewer extra parameters.

Traditional test time scaling uses a separate 7B‑72B process reward network that is costly to train, slow to run, and often misaligned with the policy it judges.

The paper introduces a reflective generative form where all transformer layers stay shared. Two light heads handle token prediction and a X to X step score, adding only 5‑26M weights.

Step scores learn without manual traces. The model keeps a step only when its own prediction matches the final answer, filtering noise and causing the aha moment where correct and wrong paths pull apart.

During inference it samples 2, 8, or XX chains, keeps the highest score, and lifts AIME24 from XXXX% to XXXX% on a 1.5B model. The 32B version matches o3‑mini across math, coding, and Chinese tests, and even steers Monte Carlo Tree Search to 52.8%.

----

Paper – arxiv. org/abs/2507.01951

Paper Title: "Test-Time Scaling with Reflective Generative Model"

![](https://pbs.twimg.com/media/Gvz1z8hWQAAtq59.jpg)

XXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1944699659823567216/c:line.svg)

[Post Link](https://x.com/rohanpaul_ai/status/1944699659823567216)