[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  Rohan Paul [@rohanpaul_ai](/creator/twitter/rohanpaul_ai) on x 73.3K followers Created: 2025-07-14 10:05:04 UTC Shared layers mean reasoning and judging run together without extra latency. MetaStone-S1 folds the reward model into the policy itself, so a 32B network scores its own reasoning and reaches o3‑mini level math with XX% fewer extra parameters. Traditional test time scaling uses a separate 7B‑72B process reward network that is costly to train, slow to run, and often misaligned with the policy it judges. The paper introduces a reflective generative form where all transformer layers stay shared. Two light heads handle token prediction and a X to X step score, adding only 5‑26M weights. Step scores learn without manual traces. The model keeps a step only when its own prediction matches the final answer, filtering noise and causing the aha moment where correct and wrong paths pull apart. During inference it samples 2, 8, or XX chains, keeps the highest score, and lifts AIME24 from XXXX% to XXXX% on a 1.5B model. The 32B version matches o3‑mini across math, coding, and Chinese tests, and even steers Monte Carlo Tree Search to 52.8%. ---- Paper – arxiv. org/abs/2507.01951 Paper Title: "Test-Time Scaling with Reflective Generative Model"  XXXXX engagements  [Post Link](https://x.com/rohanpaul_ai/status/1944699659823567216)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Rohan Paul @rohanpaul_ai on x 73.3K followers
Created: 2025-07-14 10:05:04 UTC
Shared layers mean reasoning and judging run together without extra latency.
MetaStone-S1 folds the reward model into the policy itself, so a 32B network scores its own reasoning and reaches o3‑mini level math with XX% fewer extra parameters.
Traditional test time scaling uses a separate 7B‑72B process reward network that is costly to train, slow to run, and often misaligned with the policy it judges.
The paper introduces a reflective generative form where all transformer layers stay shared. Two light heads handle token prediction and a X to X step score, adding only 5‑26M weights.
Step scores learn without manual traces. The model keeps a step only when its own prediction matches the final answer, filtering noise and causing the aha moment where correct and wrong paths pull apart.
During inference it samples 2, 8, or XX chains, keeps the highest score, and lifts AIME24 from XXXX% to XXXX% on a 1.5B model. The 32B version matches o3‑mini across math, coding, and Chinese tests, and even steers Monte Carlo Tree Search to 52.8%.
Paper – arxiv. org/abs/2507.01951
Paper Title: "Test-Time Scaling with Reflective Generative Model"
XXXXX engagements
/post/tweet::1944699659823567216