[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  Greg Kamradt [@GregKamradt](/creator/twitter/GregKamradt) on x 41.6K followers Created: 2025-07-10 04:45:17 UTC We got a call from @xai XX hours ago “We want to test Grok X on ARC-AGI” We heard the rumors. We knew it would be good. We didn’t know it would become the #1 public model on ARC-AGI Here’s the testing story and what the results mean: Yesterday, we chatted with Jimmy from the xAI team, who wanted us to validate their Grok X score. They did their own testing on the ARC-AGI-1 & X public evaluation set To validate their score (and measure possible overfitting), we self-tested the new model on our semi-private evaluation set We walked them through our testing policy: * No data retention * Model checkpoint must be intended for public use * Temporary increase in rate limits for burst testing They were on board, so we got started Initially, we ran into timeout errors with normal requests, so we switched to streaming. That resolved the issue So, what do these results mean? First, the facts: Grok X is now the top-performing publicly available model on ARC-AGI. This even outperforms purpose-built solutions submitted on Kaggle. Second, ARC-AGI-2 is hard for current AI models. To score well, models have to learn a mini-skill from a series of training examples, then demonstrate that skill at test time. The previous top score was ~8% (by Opus 4). Below XX% is noisy Getting XXXX% breaks through that noise barrier, Grok X is showing non-zero levels of fluid intelligence But the mission isn’t over. We need new ideas to solve ARC-AGI-2. Scale alone won’t get us there Come work on ARC-AGI with us XXXXXXXXXX engagements  **Related Topics** [tops](/topic/tops) [grok 4](/topic/grok-4) [xai](/topic/xai) [greg](/topic/greg) [Post Link](https://x.com/GregKamradt/status/1943169631491100856)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Greg Kamradt @GregKamradt on x 41.6K followers
Created: 2025-07-10 04:45:17 UTC
We got a call from @xai XX hours ago
“We want to test Grok X on ARC-AGI”
We heard the rumors. We knew it would be good. We didn’t know it would become the #1 public model on ARC-AGI
Here’s the testing story and what the results mean:
Yesterday, we chatted with Jimmy from the xAI team, who wanted us to validate their Grok X score. They did their own testing on the ARC-AGI-1 & X public evaluation set
To validate their score (and measure possible overfitting), we self-tested the new model on our semi-private evaluation set
We walked them through our testing policy:
They were on board, so we got started
Initially, we ran into timeout errors with normal requests, so we switched to streaming. That resolved the issue
So, what do these results mean?
First, the facts: Grok X is now the top-performing publicly available model on ARC-AGI. This even outperforms purpose-built solutions submitted on Kaggle.
Second, ARC-AGI-2 is hard for current AI models. To score well, models have to learn a mini-skill from a series of training examples, then demonstrate that skill at test time.
The previous top score was ~8% (by Opus 4). Below XX% is noisy
Getting XXXX% breaks through that noise barrier, Grok X is showing non-zero levels of fluid intelligence
But the mission isn’t over. We need new ideas to solve ARC-AGI-2. Scale alone won’t get us there
Come work on ARC-AGI with us
XXXXXXXXXX engagements
/post/tweet::1943169631491100856