LunarCrush LLM | post/tweet::1928510435164037342

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![scaling01 Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::1825243643529027584.png) Lisan al Gaib [@scaling01](/creator/twitter/scaling01) on x 17.3K followers
Created: 2025-05-30 17:54:52 UTC

Introducing LisanBench

LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and "stamina".

"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides

How it works:
Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:
- Differ from the previous word by exactly one letter (Levenshtein distance = 1)
- Be a valid English word
- Not repeat any previously used word

The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.

Results:

- o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends
(slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)
- Opus X and Sonnet X with 16k reasoning  tokens, also perform extremely, especially Opus which was able to beat o3 at X starting words with only one third of the reasoning tokens!
- Claude XXX with thinking taking 4th place ahead of o1
- other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini
- Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake

- strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet XXX and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini XXX Pro
- Grok 3, Sonnet XXX and XXX are a surprise!!

Inspiration:
LisanBench draws from benchmarks like AidanBench and SOLO-Bench. However, unlike AidanBench, it’s extremely cost-effective, trivially verifiable and doesn't rely on an Embedding model - the entire benchmark cost only ~$50 for XX models.
And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!

Verification:
Verification uses the words_alpha.txt dictionary from (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.

Easy Scaling, Difficulty Adjustment & Accuracy improvements:
- Scaling and Accuracy: Just add more starting words or increase the number of trials per word.
- Difficulty: Starting words vary widely - from those with XX neighbors to those with just X - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.

Why is it challenging?
LisanBench uniquely stresses:
- Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through
- Knowledge: wide vocabulary is essential
- Memory and Attention: previously used words must not be repeated
- Precision: strict adherence to Levenshtein constraints
- Long-context reasoning: coherence and constraint-tracking over hundreds of steps
- Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases

The two beautiful plots below show that the starting words are very different in difficulty. Some are in low connectivity regions, some in high-connectivity regions and others are just surrounded by dead-ends!

Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path - the longest viable chain without collapse.

We will know the chosen model when it appears.
It will be the one that finds the Golden Path and avoids every dead end. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the >100k connected component. So there is a narrow way through!

More plots with full leaderboard below!

![](https://pbs.twimg.com/media/GsNnQxuW0AAHBlk.png)

XXXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1928510435164037342/c:line.svg)

**Related Topics**
[futures](/topic/futures)
[gaib](/topic/gaib)

[Post Link](https://x.com/scaling01/status/1928510435164037342)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

Lisan al Gaib @scaling01 on x 17.3K followers Created: 2025-05-30 17:54:52 UTC

Introducing LisanBench

LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward-planning, constraint adherence, memory and attention, and long context reasoning and "stamina".

"I see possible futures, all at once. Our enemies are all around us, and in so many futures they prevail. But I do see a way, there is a narrow way through." - Paul Atreides

How it works: Models are given a starting English word and must generate the longest possible sequence of valid English words. Each subsequent word in the chain must:

Differ from the previous word by exactly one letter (Levenshtein distance = 1)
Be a valid English word
Not repeat any previously used word

The benchmark repeats this process across multiple starting words of varying difficulty. A model's final score is the cumulative length of its longest valid chains from the starting words.

Results:

o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead-ends (slight caveat: o3 was by far the most expensive one to run and used ~30-40k reasoning tokens per starting word)
Opus X and Sonnet X with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at X starting words with only one third of the reasoning tokens!
Claude XXX with thinking taking 4th place ahead of o1
other OpenAI reasoning models perform all well, but size does make a difference! o1 is ahead of o4-mini high and o3-mini
Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs - they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake
strongest non-reasoning models: Grok-3, GPT-4.5, Sonnet XXX and 3.7, Opus 4, Sonnet 4, DeepSeek-V3 and Gemini XXX Pro
Grok 3, Sonnet XXX and XXX are a surprise!!

Inspiration: LisanBench draws from benchmarks like AidanBench and SOLO-Bench. However, unlike AidanBench, it’s extremely cost-effective, trivially verifiable and doesn't rely on an Embedding model - the entire benchmark cost only ~$50 for XX models. And unlike SOLO-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging!

Verification: Verification uses the words_alpha.txt dictionary from (~370,105 words), but for scalability, only words from the largest connected component (108,448 words) are used.

Easy Scaling, Difficulty Adjustment & Accuracy improvements:

Scaling and Accuracy: Just add more starting words or increase the number of trials per word.
Difficulty: Starting words vary widely - from those with XX neighbors to those with just X - effectively distinguishing between moderately strong and elite models. Difficulty can also be gauged via local connectivity and branching factor.

Why is it challenging? LisanBench uniquely stresses:

Forward planning: avoiding dead ends by strategic word choices - models must find the narrow way through
Knowledge: wide vocabulary is essential
Memory and Attention: previously used words must not be repeated
Precision: strict adherence to Levenshtein constraints
Long-context reasoning: coherence and constraint-tracking over hundreds of steps
Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases

The two beautiful plots below show that the starting words are very different in difficulty. Some are in low connectivity regions, some in high-connectivity regions and others are just surrounded by dead-ends!

Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path - the longest viable chain without collapse.

We will know the chosen model when it appears. It will be the one that finds the Golden Path and avoids every dead end. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the >100k connected component. So there is a narrow way through!

More plots with full leaderboard below!

XXXXXX engagements

Engagements Line Chart

Related Topics futures gaib

Post Link