[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
@sayashk
"Conducting rigorous evaluations required developing infrastructure to handle logging scaffold and benchmark support orchestration across VMs and integration with other tools in the ecosystem like Docent and Weave. We plan to conduct many more rigorous agent evaluations over the next year and continue sharing insights from our analysis. Follow @halevals for updates on HAL. I'm grateful to have a fantastic team in place working on HAL: @random_walker @benediktstroebl @PKirgis @nityndg @siegelz_ @wei_boyi @xue_tianci @RonZiruChen @flxcn @SaitejaUtpala @ndzfs Dheeraj Oruganty Sophie Luskin"
X Link @sayashk 2025-10-15T21:11Z 10.7K followers, XXX engagements
"Good question. OpenAI still supports reasoning models via the completions API. While the intermediate reasoning tokens are discarded after each call we can still set reasoning efforts for each individual call. Since we were comparing models in a standardized way (with the same API format) we opted to continue using OpenAI reasoning models with the completions API with this release. But we are in the process of using the responses API and comparing results against the completions API"
X Link @sayashk 2025-10-15T21:41Z 10.7K followers, XXX engagements
"📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today we release a paper that condenses our insights from 20000+ agent rollouts on X challenging benchmarks spanning web coding science and customer service tasks. Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵"
X Link @sayashk 2025-10-15T20:54Z 10.7K followers, 25.9K engagements
"@TransluceAI 5) The most token-efficient models are not the cheapest. On comparisons of token cost vs. accuracy Opus XXX is on the Pareto frontier for X benchmarks. This matters because providers change model prices frequently (for example o3's price dropped by XX% soon after launch)"
X Link @sayashk 2025-10-15T21:03Z 10.7K followers, XXX engagements
"4) We analyzed the tradeoffs between cost vs. accuracy. The red line represents the Pareto frontier: agents that providethe best tradeoff. Surprisingly the most expensive model (Opus 4.1) tops the leaderboard only once. The models most often on the Pareto frontier are Gemini Flash (7/9 benchmarks) GPT-5 and o4-mini (4/9 benchmarks)"
X Link @sayashk 2025-10-15T21:02Z 10.7K followers, XXX engagements