[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

@sayashk Avatar @sayashk Sayash Kapoor

Sayash Kapoor posts on X about infrastructure, over the, open ai, stuck in the most. They currently have XXXXXX followers and XX posts still getting attention that total XXXXX engagements in the last XX hours.

Engagements: XXXXX #

Engagements Line Chart

Mentions: X #

Mentions Line Chart

Followers: XXXXXX #

Followers Line Chart

CreatorRank: XXXXXXXXX #

CreatorRank Line Chart

Social Influence #


Social category influence technology brands XXXX%

Social topic influence infrastructure #670, over the 5.56%, open ai 5.56%, stuck in 5.56%, loops 5.56%, token 5.56%, most expensive XXXX%

Top accounts mentioned or mentioned by @randomwalker @random_walker @rishibommasani @princetoncitp @bifarinthefifth @davidjalba94 @ofirpress @halevals @benediktstroebl @pkirgis @nityndg @siegelz @weiboyi @xuetianci @ronziruchen @flxcn @saitejautpala @ndzfs @khl53182440 @botaoyu24

Top Social Posts #


Top posts by engagements in the last XX hours

"Conducting rigorous evaluations required developing infrastructure to handle logging scaffold and benchmark support orchestration across VMs and integration with other tools in the ecosystem like Docent and Weave. We plan to conduct many more rigorous agent evaluations over the next year and continue sharing insights from our analysis. Follow @halevals for updates on HAL. I'm grateful to have a fantastic team in place working on HAL: @random_walker @benediktstroebl @PKirgis @nityndg @siegelz_ @wei_boyi @xue_tianci @RonZiruChen @flxcn @SaitejaUtpala @ndzfs Dheeraj Oruganty Sophie Luskin"
X Link @sayashk 2025-10-15T21:11Z 10.8K followers, 1460 engagements

"Good question. OpenAI still supports reasoning models via the completions API. While the intermediate reasoning tokens are discarded after each call we can still set reasoning efforts for each individual call. Since we were comparing models in a standardized way (with the same API format) we opted to continue using OpenAI reasoning models with the completions API with this release. But we are in the process of using the responses API and comparing results against the completions API"
X Link @sayashk 2025-10-15T21:41Z 10.8K followers, 1329 engagements

"Were still looking into it but have a few hypotheses. For example on some web benchmarks (e.g. AssistantBench) more capable models get stuck in loops trying to solve CAPTCHAs whereas less capable ones just open another tab and try again when they encounter one. But we're still verifying if this leads to the difference (+ testing similar hypotheses across benchmarks to see what explains the regressions). We typically evaluate each scaffold on each benchmark just once so I expect some differences are just because of the single-sample accuracy (we're working on fixing this). While not directly"
X Link @sayashk 2025-10-17T00:52Z 10.8K followers, XX engagements

"📣New paper: Rigorous AI agent evaluation is much harder than it seems. For the last year we have been working on infrastructure for fair agent evaluations on challenging benchmarks. Today we release a paper that condenses our insights from 20000+ agent rollouts on X challenging benchmarks spanning web coding science and customer service tasks. Our key insight: Benchmark accuracy hides many important details. Take claims of agents' accuracy with a huge grain of salt. 🧵"
X Link @sayashk 2025-10-15T20:54Z 10.8K followers, 65.2K engagements

"@TransluceAI 5) The most token-efficient models are not the cheapest. On comparisons of token cost vs. accuracy Opus XXX is on the Pareto frontier for X benchmarks. This matters because providers change model prices frequently (for example o3's price dropped by XX% soon after launch)"
X Link @sayashk 2025-10-15T21:03Z 10.8K followers, 1167 engagements

"4) We analyzed the tradeoffs between cost vs. accuracy. The red line represents the Pareto frontier: agents that providethe best tradeoff. Surprisingly the most expensive model (Opus 4.1) tops the leaderboard only once. The models most often on the Pareto frontier are Gemini Flash (7/9 benchmarks) GPT-5 and o4-mini (4/9 benchmarks)"
X Link @sayashk 2025-10-15T21:02Z 10.8K followers, 1266 engagements