LunarCrush LLM | post/tweet::1943174583227682908

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![rohanpaul_ai Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::2588345408.png) Rohan Paul [@rohanpaul_ai](/creator/twitter/rohanpaul_ai) on x 73.6K followers
Created: 2025-07-10 05:04:57 UTC

Grok X (Thinking) clocks XXXX% on ARC-AGI-2, grabbing the new SOTA. That score is almost 2x the last commercial best and now tops the Kaggle leaderboard.

---

What ARC-AGI-2 tries to measure ?

The benchmark contains a larger, freshly curated set of grid-based puzzles that cannot be memorized, forcing any model to invent a rule on the fly from a handful of examples, then apply that rule to a held-out test grid.

Unlike ARC-AGI-1, the new version adds an explicit cost axis, so a model must prove both adaptability and efficiency instead of relying on brute-force search with huge compute budgets.

Grok 4’s jump to XXXX % is the first clear break out of the single-digit “wall,” more than doubling the previous commercial high and overtaking the top Kaggle competition entry.

Also that means, xAI has improved test-time reasoning techniques rather than simply scaling parameters, because the cost-per-task appears to remain in the low-dollar range as visible on the public leaderboard.

![](https://pbs.twimg.com/media/GveJwoxaUAAgFbk.jpg)

XXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1943174583227682908/c:line.svg)

**Related Topics**
[puzzles](/topic/puzzles)
[curated](/topic/curated)

[Post Link](https://x.com/rohanpaul_ai/status/1943174583227682908)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

Rohan Paul @rohanpaul_ai on x 73.6K followers Created: 2025-07-10 05:04:57 UTC

Grok X (Thinking) clocks XXXX% on ARC-AGI-2, grabbing the new SOTA. That score is almost 2x the last commercial best and now tops the Kaggle leaderboard.

What ARC-AGI-2 tries to measure ?

The benchmark contains a larger, freshly curated set of grid-based puzzles that cannot be memorized, forcing any model to invent a rule on the fly from a handful of examples, then apply that rule to a held-out test grid.

Unlike ARC-AGI-1, the new version adds an explicit cost axis, so a model must prove both adaptability and efficiency instead of relying on brute-force search with huge compute budgets.

Grok 4’s jump to XXXX % is the first clear break out of the single-digit “wall,” more than doubling the previous commercial high and overtaking the top Kaggle competition entry.

Also that means, xAI has improved test-time reasoning techniques rather than simply scaling parameters, because the cost-per-task appears to remain in the low-dollar range as visible on the public leaderboard.

XXXXX engagements

Engagements Line Chart

Related Topics puzzles curated

Post Link