[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  Rohan Paul [@rohanpaul_ai](/creator/twitter/rohanpaul_ai) on x 73.6K followers Created: 2025-07-10 05:04:57 UTC Grok X (Thinking) clocks XXXX% on ARC-AGI-2, grabbing the new SOTA. That score is almost 2x the last commercial best and now tops the Kaggle leaderboard. --- What ARC-AGI-2 tries to measure ? The benchmark contains a larger, freshly curated set of grid-based puzzles that cannot be memorized, forcing any model to invent a rule on the fly from a handful of examples, then apply that rule to a held-out test grid. Unlike ARC-AGI-1, the new version adds an explicit cost axis, so a model must prove both adaptability and efficiency instead of relying on brute-force search with huge compute budgets. Grok 4’s jump to XXXX % is the first clear break out of the single-digit “wall,” more than doubling the previous commercial high and overtaking the top Kaggle competition entry. Also that means, xAI has improved test-time reasoning techniques rather than simply scaling parameters, because the cost-per-task appears to remain in the low-dollar range as visible on the public leaderboard.  XXXXX engagements  **Related Topics** [puzzles](/topic/puzzles) [curated](/topic/curated) [Post Link](https://x.com/rohanpaul_ai/status/1943174583227682908)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Rohan Paul @rohanpaul_ai on x 73.6K followers
Created: 2025-07-10 05:04:57 UTC
Grok X (Thinking) clocks XXXX% on ARC-AGI-2, grabbing the new SOTA. That score is almost 2x the last commercial best and now tops the Kaggle leaderboard.
What ARC-AGI-2 tries to measure ?
The benchmark contains a larger, freshly curated set of grid-based puzzles that cannot be memorized, forcing any model to invent a rule on the fly from a handful of examples, then apply that rule to a held-out test grid.
Unlike ARC-AGI-1, the new version adds an explicit cost axis, so a model must prove both adaptability and efficiency instead of relying on brute-force search with huge compute budgets.
Grok 4’s jump to XXXX % is the first clear break out of the single-digit “wall,” more than doubling the previous commercial high and overtaking the top Kaggle competition entry.
Also that means, xAI has improved test-time reasoning techniques rather than simply scaling parameters, because the cost-per-task appears to remain in the low-dollar range as visible on the public leaderboard.
XXXXX engagements
/post/tweet::1943174583227682908