[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  Lisan al Gaib [@scaling01](/creator/twitter/scaling01) on x 17.5K followers Created: 2025-05-05 13:50:54 UTC I'm back and Gemini XXX Pro is still the king (no glaze) I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params: INITIAL_RATING = 1500 INITIAL_RD = XXX INITIAL_VOL = XXXX TAU (τ) = XXX Furthermore I increased the minimum number of appearances from X to XX benchmarks to make it more stable. The labels show the lower XX% ratings (a conservative lower skill estimate) and in brackets the number of benchmarks the model appeared in. Below this post I attached the full table with mu, sigma, lower XX% ratings and number of appearances.  XXXXXX engagements  **Related Topics** [llm](/topic/llm) [tau](/topic/tau) [cleaning](/topic/cleaning) [gaib](/topic/gaib) [Post Link](https://x.com/scaling01/status/1919389344617414824)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Lisan al Gaib @scaling01 on x 17.5K followers
Created: 2025-05-05 13:50:54 UTC
I'm back and Gemini XXX Pro is still the king (no glaze)
I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params: INITIAL_RATING = 1500 INITIAL_RD = XXX INITIAL_VOL = XXXX TAU (τ) = XXX
Furthermore I increased the minimum number of appearances from X to XX benchmarks to make it more stable.
The labels show the lower XX% ratings (a conservative lower skill estimate) and in brackets the number of benchmarks the model appeared in. Below this post I attached the full table with mu, sigma, lower XX% ratings and number of appearances.
XXXXXX engagements
/post/tweet::1919389344617414824