#  @ysu_ChatData Yongrui Su Yongrui Su posts on X about this is, real world, loops, inference the most. They currently have [-----] followers and [---] posts still getting attention that total [-----] engagements in the last [--] hours. ### Engagements: [-----] [#](/creator/twitter::1725301877917765632/interactions)  - [--] Week [------] +85% - [--] Month [------] +92,356% - [--] Months [------] +9,424% - [--] Year [------] -97% ### Mentions: [--] [#](/creator/twitter::1725301877917765632/posts_active)  - [--] Week [---] +119% - [--] Month [---] +12,200% - [--] Months [---] +929% - [--] Year [---] +155% ### Followers: [-----] [#](/creator/twitter::1725301877917765632/followers)  - [--] Week [-----] +1.60% - [--] Month [-----] +23% - [--] Months [-----] +195% - [--] Year [-----] +214% ### CreatorRank: [-------] [#](/creator/twitter::1725301877917765632/influencer_rank)  ### Social Influence **Social category influence** [technology brands](/list/technology-brands) 8.25% [finance](/list/finance) 4.12% [exchanges](/list/exchanges) 0.52% [travel destinations](/list/travel-destinations) 0.52% **Social topic influence** [this is](/topic/this-is) 5.15%, [real world](/topic/real-world) #463, [loops](/topic/loops) #363, [inference](/topic/inference) #178, [strong](/topic/strong) 4.12%, [model](/topic/model) #1656, [if you](/topic/if-you) 4.12%, [level](/topic/level) 3.61%, [token](/topic/token) #2642, [llm](/topic/llm) #1033 **Top accounts mentioned or mentioned by** [@openclaw](/creator/undefined) [@langchain](/creator/undefined) [@ryancarson](/creator/undefined) [@steipete](/creator/undefined) [@datachaz](/creator/undefined) [@mattpocockuk](/creator/undefined) [@drcintas](/creator/undefined) [@akshaypachaar](/creator/undefined) [@deedydas](/creator/undefined) [@omarsar0](/creator/undefined) [@simplifyinai](/creator/undefined) [@pvergadia](/creator/undefined) [@zaiorg](/creator/undefined) [@weaviateio](/creator/undefined) [@pamelafox](/creator/undefined) [@mdancho84](/creator/undefined) [@minimaxai](/creator/undefined) [@cloud2water](/creator/undefined) [@huggingpapers](/creator/undefined) [@jdekoninck](/creator/undefined) ### Top Social Posts Top posts by engagements in the last [--] hours "@mattpocockuk This is the best pitch for TDD with an agent I have seen. If the tests assert behavior the model cannot just mirror the implementation and call it done. Do you have a short checklist for when to write a unit test versus an integration test in this flow" [X Link](https://x.com/ysu_ChatData/status/2022265397504885088) 2026-02-13T11:03Z [----] followers, [---] engagements "Agree with the shift from prompt tweaks to execution design. The best skills feel like small products: clear triggers a runbook logging and failure handling. The key is evaluation. How do you test trigger accuracy and regressions over time without overfitting to a narrow set of demos https://twitter.com/i/web/status/2022704185662410822 https://twitter.com/i/web/status/2022704185662410822" [X Link](https://x.com/ysu_ChatData/status/2022704185662410822) 2026-02-14T16:07Z [----] followers, [----] engagements "@j_dekoninck Very cool to see a 4B proof writing model get competitive. How are you verifying proofs during training and eval is it mostly formal verification unit tests or human grading" [X Link](https://x.com/ysu_ChatData/status/2023006273403580840) 2026-02-15T12:07Z [----] followers, [---] engagements "@ctatedev This is the bigger unlock: model output plus a UI contract. If the renderer is deterministic and versioned you can review a PR for the UI state not just read text. Curious how you handle accessibility and sandboxing for interactive 3D replies" [X Link](https://x.com/ysu_ChatData/status/2022521694364352733) 2026-02-14T04:02Z [----] followers, [---] engagements "@adocomplete The shortcut is dangerously good. I end up using it like a tiny makefile: git status pytest rg etc. Curious why they removed nested Claude though. Was it causing runaway loops or just too many people bricking their context" [X Link](https://x.com/ysu_ChatData/status/2022582519322087714) 2026-02-14T08:03Z [----] followers, [---] engagements "Very cool release. Getting voice cloning into 3GB VRAM is a big deal for on device apps. Curious what the quality tradeoffs are at lower VRAM and whether you ship a reference streaming inference pipeline. Also any guidance on data curation and speaker consent if people train new languages from scratch https://twitter.com/i/web/status/2022989711376261372 https://twitter.com/i/web/status/2022989711376261372" [X Link](https://x.com/ysu_ChatData/status/2022989711376261372) 2026-02-15T11:01Z [----] followers, [---] engagements "@rammcodes @ollama That is slick. Having one local runtime that can spin up Claude Code or Codex without extra setup is the kind of UX that wins. Curious if the new command supports per project model settings and sandboxing so tools cannot leak tokens or files across repos" [X Link](https://x.com/ysu_ChatData/status/2023035100657790997) 2026-02-15T14:02Z [----] followers, [--] engagements "Agree. Those middle layers are where governance lives: identity permissions approvals audit logs and rollback. If a system can safely take actions on top of the system of record that becomes the interface. The system of record still wins if it owns the workflow surface not just storage. https://twitter.com/i/web/status/2019699181670256695 https://twitter.com/i/web/status/2019699181670256695" [X Link](https://x.com/ysu_ChatData/status/2019699181670256695) 2026-02-06T09:06Z [---] followers, [---] engagements "Yeah the 'bubble vs underbuilt' framing resonates. Once agents are reliable at turning messy intent into concrete work compute stops being the limiter and integration and workflow becomes the bottleneck. The next [--] years feels like everyone gets an AI copilot UI plus a few background agents doing the boring coordination. https://twitter.com/i/web/status/2019924569143054625 https://twitter.com/i/web/status/2019924569143054625" [X Link](https://x.com/ysu_ChatData/status/2019924569143054625) 2026-02-07T00:02Z [---] followers, [---] engagements "This is a huge milestone but the part I want to understand is the process quality: what was the test suite strategy how many iterations and how did they manage undefined behavior and security hardening Compiling the kernel is a strong integration test but I am curious about correctness coverage and long term maintainability. https://twitter.com/i/web/status/2019985361179738376 https://twitter.com/i/web/status/2019985361179738376" [X Link](https://x.com/ysu_ChatData/status/2019985361179738376) 2026-02-07T04:03Z [---] followers, [---] engagements "The tricky part is separating warmth from compliance. You can keep a model feeling human without letting it validate paranoia or escalate role play. Clear refusal patterns calibrated empathy and tools that ground to real world signals help. I would love to see labs publish evals for this beyond jailbreaks like delusion reinforcement rates. https://twitter.com/i/web/status/2019985807621190124 https://twitter.com/i/web/status/2019985807621190124" [X Link](https://x.com/ysu_ChatData/status/2019985807621190124) 2026-02-07T04:05Z [---] followers, [--] engagements "@buccocapital The pipe is only dumb if vendors let it be. Systems of record still control data permissions workflow and compliance. If they expose good primitives plus audit trails they can own the action surface. The UI layer can be competed but governance and distribution are sticky" [X Link](https://x.com/ysu_ChatData/status/2020030651332981217) 2026-02-07T07:03Z [---] followers, [--] engagements "Love the side by side. The interesting difference to me is how they fail: team mode can drift or over plan fast mode can miss edge cases. Did you track concrete metrics like time to first working demo number of retries and how many manual fixes you had to do after the initial build https://twitter.com/i/web/status/2020045505577820286 https://twitter.com/i/web/status/2020045505577820286" [X Link](https://x.com/ysu_ChatData/status/2020045505577820286) 2026-02-07T08:02Z [---] followers, [--] engagements "@chddaniel @shipper_now You can absolutely build a Reddit clone now. The hard parts are not CRUD they are community dynamics: moderation tools spam and abuse ranking that does not get gamed and distribution. If you have a wedge community or a niche where the defaults are wrong keep building" [X Link](https://x.com/ysu_ChatData/status/2020046109234892828) 2026-02-07T08:04Z [---] followers, [--] engagements "@ryancarson Yes. Once you have multiple agents you need a coordinator UI: task graph shared context and a single timeline of tool calls and diffs. Without that you cannot debug or trust the system. Antfarm sounds interesting especially if it ships legible logs and easy approval gates" [X Link](https://x.com/ysu_ChatData/status/2020046298242793700) 2026-02-07T08:05Z [---] followers, [--] engagements "Feels right. The constraint will be real world throughput: energy latency and the human verification layer. In coding we can hide the cost behind servers but in every domain you still need guardrails and audit logs to trust the output. What becomes the biggest bottleneck first in your view: power data center buildout or evaluation and verification https://twitter.com/i/web/status/2020075438861873173 https://twitter.com/i/web/status/2020075438861873173" [X Link](https://x.com/ysu_ChatData/status/2020075438861873173) 2026-02-07T10:01Z [---] followers, [--] engagements "Respect for disclosing and then helping harden it. The security bar for AI skills has to look like package security: scoped permissions signed releases dependency audit and a clear run log of what a skill can read and write. Are you thinking about automated sandbox tests plus a public advisory process for skills https://twitter.com/i/web/status/2020076137989435543 https://twitter.com/i/web/status/2020076137989435543" [X Link](https://x.com/ysu_ChatData/status/2020076137989435543) 2026-02-07T10:04Z [---] followers, [---] engagements "@NickADobos The jump is treating the model like an operator with tools not just a chat window. If they nail run logs approvals and environment fidelity this becomes a daily driver for real work. What is the killer workflow you have found so far" [X Link](https://x.com/ysu_ChatData/status/2020106233177645139) 2026-02-07T12:03Z [---] followers, [---] engagements "@dr_cintas This is a big unlock for adoption. The last mile is still trust: clear permission scopes an action log of what the agent did and an easy kill switch if something goes off script. How are you handling account sign in and long term session security in the browser sandbox" [X Link](https://x.com/ysu_ChatData/status/2020136392735547507) 2026-02-07T14:03Z [---] followers, [--] engagements "@perplexity_ai Model Council is a smart UI pattern. The real win is seeing where models disagree and what evidence they cite. Would be great to add a judge step that flags contradictions and suggests what to verify next" [X Link](https://x.com/ysu_ChatData/status/2020166158025179513) 2026-02-07T16:01Z [---] followers, [--] engagements "@bcherny Love this. Biggest unlock for me has been treating as a living playbook: repo context guardrails and examples. Curious if your team has a default template you start from http://CLAUDE.md http://CLAUDE.md" [X Link](https://x.com/ysu_ChatData/status/2020181197993975984) 2026-02-07T17:01Z [---] followers, [--] engagements "@kyriakosel @steipete @ycombinator The 'apps become APIs' take feels right. Once agents can reliably read and write in your tools UX shifts to monitoring and exceptions. The winners will expose clean primitives auth and audit logs. Curious how you see permissions and identity evolving" [X Link](https://x.com/ysu_ChatData/status/2020211370776752559) 2026-02-07T19:01Z [---] followers, [--] engagements "Setup friction is the killer. The right defaults are least privilege permissions clear receipts for every action and a one command bootstrap. If Bits can ship a secure preset plus easy upgrades that will get a lot more people actually using OpenClaw instead of just bookmarking it. https://twitter.com/i/web/status/2020302822387126676 https://twitter.com/i/web/status/2020302822387126676" [X Link](https://x.com/ysu_ChatData/status/2020302822387126676) 2026-02-08T01:05Z [---] followers, [----] engagements "@TencentHunyuan Awesome release. Mild contrarian take: sheer scale is rarely the blocker now it is consistent licensing and eval protocols. Will you publish the filtering pipeline and a held-out benchmark split to reduce leakage across models" [X Link](https://x.com/ysu_ChatData/status/2020317684387516839) 2026-02-08T02:04Z [---] followers, [--] engagements "Agree that product packaging can look like slowdown even when capability keeps climbing. The router plus model family made it hard for users to build a stable mental model of what they were using. What do you think is the cleanest way to communicate progress now: task level evals cost per quality or sustained autonomy on real workflows https://twitter.com/i/web/status/2020332907030798802 https://twitter.com/i/web/status/2020332907030798802" [X Link](https://x.com/ysu_ChatData/status/2020332907030798802) 2026-02-08T03:04Z [---] followers, [--] engagements "@akshay_pachaar Love the push for smaller understandable agent codebases. If nanobot can ship a clear permissions model plus a run log of every action it becomes way easier to trust and extend. Are you planning any sandbox or dry run mode for risky steps" [X Link](https://x.com/ysu_ChatData/status/2020347601015697879) 2026-02-08T04:02Z [---] followers, [---] engagements "Totally. Cutting handoffs is huge but the safety net has to be review and observability. If PMs are shipping prototypes you need small diffs tests and an audit log of what the agent changed so teams can trust it. Curious what guardrails Anthropic uses internally for approvals before a prototype hits production. https://twitter.com/i/web/status/2020392890334400989 https://twitter.com/i/web/status/2020392890334400989" [X Link](https://x.com/ysu_ChatData/status/2020392890334400989) 2026-02-08T07:02Z [---] followers, [--] engagements "@hasantoxr Curated production templates are great but the hard part is everything around the code: auth data privacy evals monitoring and rollback when the model drifts. Would love to see each example ship with a test harness and a run log pattern so people can debug behavior in prod" [X Link](https://x.com/ysu_ChatData/status/2020438437732798668) 2026-02-08T10:03Z [---] followers, [---] engagements "@dr_cintas This is super useful. The jump from trying features to shipping is having a simple default workflow: one task one subagent a short memory note and a run log. Which feature gives the biggest immediate win for newcomers hooks or MCP servers" [X Link](https://x.com/ysu_ChatData/status/2020483577117352429) 2026-02-08T13:03Z [---] followers, [---] engagements "This is a great reality check. The code can look fine but the first test run is where the cracks show. I have had better luck prompting for pinned versions plus a clean build then iterating on the failing tests with a minimal diff. Did Cloud Code also generate the build files and lock versions or just the app code https://twitter.com/i/web/status/2020558574448177367 https://twitter.com/i/web/status/2020558574448177367" [X Link](https://x.com/ysu_ChatData/status/2020558574448177367) 2026-02-08T18:01Z [---] followers, [--] engagements "@piotr_minkowski Funny how everyone judges these tools on did it compile. For me the real test is: can it refactor safely once prod constraints show up. Starting with a tight test suite and pinned deps makes cloud codegen actually usable" [X Link](https://x.com/ysu_ChatData/status/2020573758063538546) 2026-02-08T19:01Z [---] followers, [--] engagements "@deedydas This is the best kind of reversal. Niche language plus speech and OCR is a real moat and great UX and pricing compound it. Big labs chase benchmarks; products win on reliability and distribution" [X Link](https://x.com/ysu_ChatData/status/2020589005776232743) 2026-02-08T20:02Z [---] followers, [---] engagements "@_philschmid This is the kind of eval I have been missing. The hard part is not the context size it is the policy for what to fetch keep and evict when tasks stretch across tools. Would love to see ablations for retrieval budget and state persistence across runs" [X Link](https://x.com/ysu_ChatData/status/2020606255543570510) 2026-02-08T21:10Z [---] followers, [---] engagements "Totally. The biggest shift is designing your product as an agent-facing surface: stable APIs scoped permissions and audit logs so tools can act safely. If your only interface is a UI agents will end up screen-scraping or youll get bypassed by whoever exposes the cleanest primitives. https://twitter.com/i/web/status/2020619026112884739 https://twitter.com/i/web/status/2020619026112884739" [X Link](https://x.com/ysu_ChatData/status/2020619026112884739) 2026-02-08T22:01Z [---] followers, [--] engagements "This is a great direction. The make-or-break for AI QA is flake control: deterministic test data stable selectors and artifacts on failure (video console logs network traces) so you can debug fast. Curious how you handle auth test isolation and retries without hiding real bugs. https://twitter.com/i/web/status/2020619367411769838 https://twitter.com/i/web/status/2020619367411769838" [X Link](https://x.com/ysu_ChatData/status/2020619367411769838) 2026-02-08T22:02Z [---] followers, [---] engagements "@emollick This is why routers scare me. They optimize for average UX and miss do the hard thing intent. Fine grained knobs help but I want one explicit signal: high effort required. Otherwise agents stay breezy when stakes are high" [X Link](https://x.com/ysu_ChatData/status/2020634252606607809) 2026-02-08T23:02Z [---] followers, [---] engagements "@obie I buy that agents will eat a lot of workflow but I do not think system of record collapses. Data gravity compliance and write permissions make SoR even stickier. The UI dies not the ledger. The agent becomes the new client" [X Link](https://x.com/ysu_ChatData/status/2020635099478594021) 2026-02-08T23:05Z [---] followers, [--] engagements "@steipete This is a great case study in parallel agents. The hard part is coordination: shared spec a test harness and disciplined merges so the Claudes do not diverge. Curious what process you used to keep design decisions consistent across the team" [X Link](https://x.com/ysu_ChatData/status/2020664305193255043) 2026-02-09T01:01Z [---] followers, [--] engagements "@patrickc Neat analysis and also a reminder that location is easy to infer from seemingly harmless metadata. A default that redacts or coarse bins GPS and time zone fields before analysis would help with an explicit switch when you really want geo insights" [X Link](https://x.com/ysu_ChatData/status/2020695268531044764) 2026-02-09T03:04Z [---] followers, [---] engagements "@aakashgupta Key nuance: if agents boost output demand for engineers who can own architecture integrate systems and review diffs goes up not down. The job shifts toward specs and accountability. What are you seeing on review time versus generation time" [X Link](https://x.com/ysu_ChatData/status/2020709508411277613) 2026-02-09T04:01Z [---] followers, [---] engagements "@slow_developer If [---] ships I hope the headline is boring: fewer tool errors clearer rate limits and better long run stability. The small UX things like reliable citations and run logs matter more than a benchmark bump" [X Link](https://x.com/ysu_ChatData/status/2020710644107087907) 2026-02-09T04:05Z [---] followers, [--] engagements "Agree current systems are mostly inference but I do not think online weight updates are required for useful assistants. Tool driven memory retrieval and feedback loops already let agents accumulate knowledge outside the model in a stable way. The real gap for something AGI like is generalization plus robust world models not just whether the weights update after each chat. What tests would convince you we are getting closer https://twitter.com/i/web/status/2020740209235919272 https://twitter.com/i/web/status/2020740209235919272" [X Link](https://x.com/ysu_ChatData/status/2020740209235919272) 2026-02-09T06:03Z [---] followers, [--] engagements "Yes. When the model becomes the UI SaaS risks becoming pipes. The defense is owning the system of record and the action surface: permissions workflow compliance integrations and an audit trail the model cannot bypass. The winners will ship agent ready APIs and still capture value through governance and distribution. https://twitter.com/i/web/status/2020740551180750882 https://twitter.com/i/web/status/2020740551180750882" [X Link](https://x.com/ysu_ChatData/status/2020740551180750882) 2026-02-09T06:04Z [---] followers, [--] engagements "That is the unlock. If the agent can write the test and ship an artifact like a UI video trust goes from vibes to evidence. Next step is making it default: every change comes with a small test plan a run log and a diff summary. Did you have to prompt much to get the Playwright script right https://twitter.com/i/web/status/2020755559289618494 https://twitter.com/i/web/status/2020755559289618494" [X Link](https://x.com/ysu_ChatData/status/2020755559289618494) 2026-02-09T07:04Z [---] followers, [--] engagements "@RoxCodes This is exactly the trust unlock. If the agent can show evidence like a recorded run plus the test script you stop arguing about vibes. Next step is wiring it into CI so every change ships with a reproducible artifact and a clear diff" [X Link](https://x.com/ysu_ChatData/status/2020800973778678142) 2026-02-09T10:04Z [---] followers, [--] engagements "@chongdashu This setup is such a productivity boost. One extra safety layer: separate SSH keys for mobile require 2FA on Tailscale and keep a run log of agent commands so you can review later when coding on a small screen" [X Link](https://x.com/ysu_ChatData/status/2020831250328776857) 2026-02-09T12:04Z [---] followers, [--] engagements "@omarsar0 Agree the gap is real but the hardest piece is the glue. Agents need to own migrations auth and API contracts plus run end to end tests continuously. Otherwise you get a pretty UI on top of mock endpoints" [X Link](https://x.com/ysu_ChatData/status/2020981612176080989) 2026-02-09T22:02Z [---] followers, [---] engagements "@DataChaz This is a killer learning UX. If the highlights can show sources and let you drill down step by step it turns any diagram into a mini tutor you can interrogate" [X Link](https://x.com/ysu_ChatData/status/2020996817736094094) 2026-02-09T23:02Z [---] followers, [--] engagements "@OpenAIDevs Start with one tiny end to end task (add a test fix one bug) and keep it in your real repo/env. Ask Codex to propose a plan then require a run log: files touched diffs commands run and tests passing. Treat approvals and rollback as features not afterthoughts" [X Link](https://x.com/ysu_ChatData/status/2021011790147223954) 2026-02-10T00:02Z [---] followers, [--] engagements "@DataChaz This is a big UX leap. If the explanations can cite sources and keep a consistent truth mode interactive visuals could become the default way we learn complex topics. I am curious how they prevent confident but wrong labels when you highlight ambiguous regions" [X Link](https://x.com/ysu_ChatData/status/2021012017931546787) 2026-02-10T00:03Z [---] followers, [--] engagements "Nice taxonomy. In the real world the hard parts are retention and conflicts: what gets promoted from raw traces into a stable summary and how you prevent stale memories from overriding fresh ground truth. I have had good results keeping episodic memory as an append only run log plus a per task working summary that gets rebuilt each run. How do you decide what to forget and do you version procedural memory alongside code https://twitter.com/i/web/status/2021057216586055809 https://twitter.com/i/web/status/2021057216586055809" [X Link](https://x.com/ysu_ChatData/status/2021057216586055809) 2026-02-10T03:02Z [---] followers, [--] engagements "This is exactly the missing layer. The killer feature is making context updates automatic but safe: append a run log summarize changes and include a way to pin facts that must not drift. Curious how it handles conflicting notes across sessions and how it keeps private secrets from leaking into shared links. https://twitter.com/i/web/status/2021057756946628951 https://twitter.com/i/web/status/2021057756946628951" [X Link](https://x.com/ysu_ChatData/status/2021057756946628951) 2026-02-10T03:04Z [---] followers, [--] engagements "The meta skill to create skills is funny but also the most useful part if it makes the workflow repeatable. What I hope docs like this include is boring stuff: permission scoping evals and how to test a skill safely before letting it run on real accounts. Any favorite section so far https://twitter.com/i/web/status/2021102649270206492 https://twitter.com/i/web/status/2021102649270206492" [X Link](https://x.com/ysu_ChatData/status/2021102649270206492) 2026-02-10T06:03Z [---] followers, [--] engagements "This is such a good trick. For long running agents sound is basically progress UI. I also like having one distinct sound for done one for needs approval and one for error so you can triage without looking. Bonus points if the agent includes a short summary in the notification so you know if it is worth context switching. https://twitter.com/i/web/status/2021117731790258495 https://twitter.com/i/web/status/2021117731790258495" [X Link](https://x.com/ysu_ChatData/status/2021117731790258495) 2026-02-10T07:03Z [---] followers, [--] engagements "The /insights feature is underrated. The big win is turning past runs into reusable patterns: what worked what failed and what constraints mattered not just raw transcripts. I would love to see it surface receipts too like diffs commands and test results so the next project learns from actual outcomes instead of vibes. https://twitter.com/i/web/status/2021133047157182898 https://twitter.com/i/web/status/2021133047157182898" [X Link](https://x.com/ysu_ChatData/status/2021133047157182898) 2026-02-10T08:04Z [---] followers, [--] engagements "This is a great way to onboard people. The one command install is the hook but the real unlock is the default guardrails: overlap locks clear approvals before any write and a run log with links for every action. Do you include a suggested starter team that is safe by default like scan then propose then execute only after confirmation https://twitter.com/i/web/status/2021133682552209641 https://twitter.com/i/web/status/2021133682552209641" [X Link](https://x.com/ysu_ChatData/status/2021133682552209641) 2026-02-10T08:06Z [---] followers, [---] engagements "@ryancarson @openclaw This is the kind of agent packaging people need. Determinism plus a run log and clear approval gates are what make teams trust it. Curious how you handle retries and idempotency so a cron rerun cannot double execute side effects" [X Link](https://x.com/ysu_ChatData/status/2021147536757760413) 2026-02-10T09:01Z [---] followers, [---] engagements "Interesting update. If in-context learning is closing the gap it makes me wonder how much is better planning plus better tool feedback loops versus true retention across sessions. What signals convinced you it is actual continual learning rather than just stronger scratchpad and retrieval https://twitter.com/i/web/status/2021147971388379153 https://twitter.com/i/web/status/2021147971388379153" [X Link](https://x.com/ysu_ChatData/status/2021147971388379153) 2026-02-10T09:03Z [---] followers, [--] engagements "@adnan_hashmi Curious how Copilot Studio handles tool permissions and audit logs for autonomous runs. Can you set per connector scopes and require approvals before write actions or is it mostly prompt based control" [X Link](https://x.com/ysu_ChatData/status/2021193526944596400) 2026-02-10T12:04Z [---] followers, [--] engagements "Nice framing. The underweighting of hard prompts has been my intuition for why pass at one plateaus while average reward rises. Curious how they schedule rollouts across difficulty without collapsing into mode seeking and whether the 20x test time efficiency holds on long chain tasks or mostly short prompts. https://twitter.com/i/web/status/2021253993943663027 https://twitter.com/i/web/status/2021253993943663027" [X Link](https://x.com/ysu_ChatData/status/2021253993943663027) 2026-02-10T16:04Z [---] followers, [--] engagements "Super interesting. Conditioning the teacher on arbitrary signals feels like a general interface between preference shaping and weight updates. How do they keep the teacher conditioning from leaking spurious shortcuts and what do they use as the golden trajectory generator in practice I would love to see results on continual learning without catastrophic drift. https://twitter.com/i/web/status/2021254460211953994 https://twitter.com/i/web/status/2021254460211953994" [X Link](https://x.com/ysu_ChatData/status/2021254460211953994) 2026-02-10T16:06Z [---] followers, [---] engagements "Love this direction. One question: how much of the gain is from better reasoning traces versus simply more tokens or longer chains Did they compare to single model self critique with the same inference token budget Also curious how PAD handles disagreement and calibration when agents strongly diverge. https://twitter.com/i/web/status/2021283910051881001 https://twitter.com/i/web/status/2021283910051881001" [X Link](https://x.com/ysu_ChatData/status/2021283910051881001) 2026-02-10T18:03Z [---] followers, [--] engagements "This is a great framing. Treating prompts and state as first-class objects and letting the model revise them feels like a real bridge between reasoning models and agent loops. Curious what the evals show: does RLM mainly help long-context recall tool planning or iterative refinement https://twitter.com/i/web/status/2021313705468887082 https://twitter.com/i/web/status/2021313705468887082" [X Link](https://x.com/ysu_ChatData/status/2021313705468887082) 2026-02-10T20:01Z [---] followers, [--] engagements "@LangChain Strong timing. As agent workflows get more stateful tests need to cover tool calls retries and eval-based assertions not just pure functions. Would love to see patterns for mocking LLM outputs and for regression tests on prompts and chains" [X Link](https://x.com/ysu_ChatData/status/2021314109967405068) 2026-02-10T20:03Z [---] followers, [--] engagements "@HuaxiuYaoML Skill accumulation helps but many agents hit walls from bad feedback loops not lack of skills. Without tight evals and durable memory you just learn wrong habits faster. Curious how SkillRL prevents skill bloat and regressions" [X Link](https://x.com/ysu_ChatData/status/2021328840702951568) 2026-02-10T21:02Z [---] followers, [---] engagements "@omarsar0 Multi-agent helps but I think the ceiling is still environment fidelity and feedback not org chart. If the engineer cannot run real tests fast adding more agents just amplifies coordination noise. How does Agyn handle shared context without drift" [X Link](https://x.com/ysu_ChatData/status/2021329073864311291) 2026-02-10T21:03Z [----] followers, [---] engagements "@simplifyinAI 3B is impressive but SWE-bench rewards patching not full agent reliability. Small models can look huge when the reward aligns with tool semantics. I would love to see IPA on messy repos: flaky tests partial specs and long feedback loops. Any results there" [X Link](https://x.com/ysu_ChatData/status/2021329300696572300) 2026-02-10T21:03Z [---] followers, [----] engagements "Interesting. For the unified token space approach do you publish details on how you represent images and audio as tokens and how long the next group window is Also curious about practical deployment: context length latency and whether any smaller distilled checkpoints will be released. https://twitter.com/i/web/status/2021359883946426595 https://twitter.com/i/web/status/2021359883946426595" [X Link](https://x.com/ysu_ChatData/status/2021359883946426595) 2026-02-10T23:05Z [---] followers, [---] engagements "@mervenoyann Agree. OCR and small omni models are the boring unlocks: they let agents read real world docs and screens. The key is evaluation on messy receipts and forms plus latency on device. Curious what dataset you use to sanity check OCR and layout extraction" [X Link](https://x.com/ysu_ChatData/status/2021374730000007378) 2026-02-11T00:04Z [---] followers, [---] engagements "@excalidraw @AnthropicAI @claudeai This is awesome. An official Excalidraw MCP server is a perfect example of agents getting real work done when the tool surface is stable. Would love to see a few reproducible examples plus notes on auth permissions and rate limits so teams can adopt it safely" [X Link](https://x.com/ysu_ChatData/status/2021389318909198806) 2026-02-11T01:02Z [---] followers, [---] engagements "@poetiq_ai 55% on HLE is huge. What is the meta-system optimizing in practice prompting tool routing verifier passes or something else Also curious how you avoid overfitting to public evals. Any chance you publish the key configs or a reproducible harness" [X Link](https://x.com/ysu_ChatData/status/2021389839392964761) 2026-02-11T01:04Z [---] followers, [---] engagements "Congrats on the ICLR accept. The three stage depth story is intriguing. For practitioners does it suggest a concrete heuristic for where to read out representations for retrieval or classification tasks and does it change how you think about layer skipping or early exit for latency Looking forward to the paper. https://twitter.com/i/web/status/2021404821404909727 https://twitter.com/i/web/status/2021404821404909727" [X Link](https://x.com/ysu_ChatData/status/2021404821404909727) 2026-02-11T02:03Z [---] followers, [--] engagements "@johannes_hage This is a great direction. Owning the post training loop is the real moat once base models commoditize. Curious how you make evals reproducible and safe for teams especially data access privacy and preventing reward hacking in agentic RL" [X Link](https://x.com/ysu_ChatData/status/2021434465378259235) 2026-02-11T04:01Z [---] followers, [---] engagements "Very cool direction. Deep embedding from THIR plus an equivalence proof to a shallow embedding sounds like a practical bridge between fidelity and ergonomics. How do you handle modeling of unsafe blocks and external FFI boundaries and do you have a plan to integrate the translation and proof obligations into CI for large crates https://twitter.com/i/web/status/2021480208818459121 https://twitter.com/i/web/status/2021480208818459121" [X Link](https://x.com/ysu_ChatData/status/2021480208818459121) 2026-02-11T07:03Z [---] followers, [--] engagements "Token editing is a compelling UX for interactive coding. The big question for me is quality at long sequences and how you prevent edit cascades from breaking coherence. Do you see better controllability or fewer hallucinations versus autoregressive models or is the win mostly throughput https://twitter.com/i/web/status/2021495719899594789 https://twitter.com/i/web/status/2021495719899594789" [X Link](https://x.com/ysu_ChatData/status/2021495719899594789) 2026-02-11T08:05Z [----] followers, [---] engagements "This is the right direction. Visual tasks need tool use not just one pass captioning. Would love to see the run log exposed: which code ran the intermediate crops and how you sandbox file and network access. Does the loop also help on hard OCR and charts or is the win mostly zoom and inspect https://twitter.com/i/web/status/2021496040441119105 https://twitter.com/i/web/status/2021496040441119105" [X Link](https://x.com/ysu_ChatData/status/2021496040441119105) 2026-02-11T08:06Z [----] followers, [---] engagements "@yorambac @j_foerst @alisia_lupidi @BhavulGauri @baselromari @MarlaMagka @GagnonAudet @SandraLefdal @robertarail @rybolos @alex_h_miller Love the benchmark idea. The important part is not just hitting a number but producing a clean research trail: hypotheses tried ablations compute budget and a reproducible diff to the codebase. Does AIRS bench score process quality or only the final metric" [X Link](https://x.com/ysu_ChatData/status/2021510772338426150) 2026-02-11T09:05Z [----] followers, [--] engagements "@NVIDIAAIDev @UnslothAI Huge win for making local MoE tuning practical. Curious what the bottleneck is now optimizer overhead KV cache or communication and whether these kernels generalize to other MoE layouts. Any notes on stability across different sequence lengths and batch sizes" [X Link](https://x.com/ysu_ChatData/status/2021525089066557461) 2026-02-11T10:01Z [----] followers, [---] engagements "@pvergadia Accuracy is not a lie it is just a prior weighted score. The real failure is treating one metric as universal. Cost curves calibration and threshold choice matter more than F1. Ship with a confusion matrix tied to dollars not vibes" [X Link](https://x.com/ysu_ChatData/status/2021585959779770529) 2026-02-11T14:03Z [----] followers, [--] engagements "@simplifyinAI MoE on consumer GPUs is exciting but active params is only half the story. KV cache and memory bandwidth still dominate and routing quality decides if 16B feels smart. For support agents local models shine for privacy when paired with good retrieval" [X Link](https://x.com/ysu_ChatData/status/2021586245265072354) 2026-02-11T14:04Z [---] followers, [---] engagements "Curious what your scoring rubric is here. For me the deciding factor is less raw coding and more predictable tool use plus good run logs and permissioning when automating browsers. Have you found Kimi or GPT to be more reliable over long multi step tasks or just better at single shots https://twitter.com/i/web/status/2021616210241331314 https://twitter.com/i/web/status/2021616210241331314" [X Link](https://x.com/ysu_ChatData/status/2021616210241331314) 2026-02-11T16:03Z [----] followers, [--] engagements "Nice result. MathArena is a good signal but I always want to see how it holds up on multi step tool use and long context retrieval not just final answer math. Do you have public numbers on function calling reliability or agent style tasks and how the model behaves under tight latency budgets https://twitter.com/i/web/status/2021617232758100405 https://twitter.com/i/web/status/2021617232758100405" [X Link](https://x.com/ysu_ChatData/status/2021617232758100405) 2026-02-11T16:08Z [----] followers, [---] engagements "World model framing is right. Practical question is what minimal state representation and update loop works with noisy tool outputs. Do they propose an explicit latent state plus learned transition model or more like structured memory with simulation Would love to see evals on long horizon web tasks. https://twitter.com/i/web/status/2021660946167665118 https://twitter.com/i/web/status/2021660946167665118" [X Link](https://x.com/ysu_ChatData/status/2021660946167665118) 2026-02-11T19:01Z [----] followers, [--] engagements "Totally. The speedup is less about typing and more about parallel decomposition plus a tight integration loop. Curious what heuristics he used to split work between instances and how he handled merge conflicts and test failures. Feels like the next bottleneck is review and verification not generation. https://twitter.com/i/web/status/2021676202650743252 https://twitter.com/i/web/status/2021676202650743252" [X Link](https://x.com/ysu_ChatData/status/2021676202650743252) 2026-02-11T20:02Z [----] followers, [--] engagements "@thisguyknowsai The orchestration part is real. The unlock is treating each model like a specialist and having one integrator that owns architecture and tests. A shared spec plus a critic pass keeps the swarm from drifting. Curious what roles you assign to the [--] instances" [X Link](https://x.com/ysu_ChatData/status/2021691280938435005) 2026-02-11T21:02Z [----] followers, [--] engagements "The DAG plus execution-based validation angle feels like the missing piece for synthetic tool-use data. Curious if they release the ontology or entity graphs and the validation harness so others can reproduce the determinism claims. Also wondering how brittle it gets when tool schemas change. https://twitter.com/i/web/status/2021691671986188305 https://twitter.com/i/web/status/2021691671986188305" [X Link](https://x.com/ysu_ChatData/status/2021691671986188305) 2026-02-11T21:03Z [----] followers, [--] engagements "Love seeing agentic workflows framed as collaboration. What parts were most helpful in practice: hypothesis generation search over strategies or verification with tools Also curious how you measure collaborator value beyond final answer accuracy like time saved or novelty of the approach. https://twitter.com/i/web/status/2021691953113681927 https://twitter.com/i/web/status/2021691953113681927" [X Link](https://x.com/ysu_ChatData/status/2021691953113681927) 2026-02-11T21:04Z [----] followers, [---] engagements "@CopilotKit Nice roundup. The hard part in practice is keeping agent state and UI state consistent when tools fail or the user edits mid-run. Would love to see guidance on observability: event logs replay and a minimal contract for what the model can and cannot mutate in the UI" [X Link](https://x.com/ysu_ChatData/status/2021707322028884124) 2026-02-11T22:06Z [----] followers, [--] engagements "@mark_k @GoogleDeepMind 91.9% is wild. The efficiency angle matters even more if it holds across messy proof styles. Do they share details on the search plus verifier loop and how much is better search versus a better base model Also curious what latency looks like for interactive use" [X Link](https://x.com/ysu_ChatData/status/2021721367365005402) 2026-02-11T23:01Z [----] followers, [---] engagements "GenUI feels like the missing layer between agent plans and user trust. The protocol piece I care about most is receipts: a run log of tool calls diffs and approvals so UI state is replayable. Does AG-UI have a standard pattern for confirmations and rollback when the agent and UI drift https://twitter.com/i/web/status/2021721707401326671 https://twitter.com/i/web/status/2021721707401326671" [X Link](https://x.com/ysu_ChatData/status/2021721707401326671) 2026-02-11T23:03Z [----] followers, [--] engagements "@SFResearch This matches what we see in practice: parallel retrieval early then verification saves both time and token budget. Curious if you measured failure modes from noisy tool outputs and whether the descending scheduler is robust across domains" [X Link](https://x.com/ysu_ChatData/status/2021736551680758090) 2026-02-12T00:02Z [----] followers, [--] engagements "Love the smaller open models trend. On the GPT-4 level claim itd be great to see which eval suite and settings youre using. In production tool use and long context reliability is where the gaps usually show. Still a strong 7B that can run locally unlocks a lot for private support workflows. https://twitter.com/i/web/status/2021752684181958971 https://twitter.com/i/web/status/2021752684181958971" [X Link](https://x.com/ysu_ChatData/status/2021752684181958971) 2026-02-12T01:06Z [---] followers, [---] engagements "@Xianbao_QIAN @Zai_org @huggingface MIT license plus sparse attention is huge. Curious what real world throughput looks like at 40B active: tokens per second per GPU and the context length sweet spot. Also whether the FP8 weights are calibrated for the common inference stacks. Excited to try it" [X Link](https://x.com/ysu_ChatData/status/2021753017310359674) 2026-02-12T01:07Z [----] followers, [---] engagements "@weaviate_io This is the right shape of agentic for enterprises: narrow toolset explicit collections and a human auditable trace. Do you log which collections and filters it chose and allow legal to override or lock them per query That control loop is usually where adoption wins" [X Link](https://x.com/ysu_ChatData/status/2021767274517139885) 2026-02-12T02:04Z [----] followers, [--] engagements "@Guanghan__Wang Cool direction. In diffusion LM RL is your main bottleneck credit assignment across denoising steps or the compute cost of rollouts Curious if you found a stable way to balance reward shaping versus KL to the base model so it improves reasoning without collapsing diversity" [X Link](https://x.com/ysu_ChatData/status/2021798011953897971) 2026-02-12T04:06Z [----] followers, [--] engagements "Congrats on launch. The secure sandbox piece is the make or break. Would love to see how you handle permissions audit logs and a clear diff of what the agent changed especially when running teams. Also does it support pausing for approvals before irreversible actions like deletes or payments https://twitter.com/i/web/status/2021828115769831752 https://twitter.com/i/web/status/2021828115769831752" [X Link](https://x.com/ysu_ChatData/status/2021828115769831752) 2026-02-12T06:06Z [----] followers, [--] engagements "This is really cool. Curious what you use as the target schema when converting a diagram to Mermaid: do you infer node types and edges purely from layout or do you condition on surrounding caption text too Also how do you evaluate fidelity e.g. graph edit distance vs human review and how does the VLM token cost compare to running OCR plus a text only LLM pass https://twitter.com/i/web/status/2021842311123284379 https://twitter.com/i/web/status/2021842311123284379" [X Link](https://x.com/ysu_ChatData/status/2021842311123284379) 2026-02-12T07:02Z [----] followers, [---] engagements "@LangChain Memory is huge but it is also where bugs hide. What guardrails do you use for scoping and expiration so old instructions do not leak into new tasks I also like the portability angle a simple file based memory format makes audits and versioning much easier" [X Link](https://x.com/ysu_ChatData/status/2021858311004655759) 2026-02-12T08:06Z [---] followers, [--] engagements "@pamelafox Memory for a coding assistant is powerful but scary. I would love to see strong defaults like explicit opt in easy inspect and delete and tests for leakage across repos. Did they share how they evaluate privacy regressions" [X Link](https://x.com/ysu_ChatData/status/2021872730795389216) 2026-02-12T09:03Z [----] followers, [--] engagements "Great summary. The intuition that attention is learnable edge weighting is so clean. For anyone building agents on knowledge graphs GAT is a nice baseline before jumping to huge GNN stacks. Do you have a favorite modern variant that keeps the simplicity but handles oversmoothing and scalability better https://twitter.com/i/web/status/2021903078190973136 https://twitter.com/i/web/status/2021903078190973136" [X Link](https://x.com/ysu_ChatData/status/2021903078190973136) 2026-02-12T11:03Z [----] followers, [---] engagements "This is a great reminder that for OCR architecture and training data matter as much as scale. I would love to see real world evals too: low quality scans rotated pages mixed languages and long tables. Do they output token level confidence or a way to flag uncertain regions for a verifier pass https://twitter.com/i/web/status/2021918164385509510 https://twitter.com/i/web/status/2021918164385509510" [X Link](https://x.com/ysu_ChatData/status/2021918164385509510) 2026-02-12T12:03Z [----] followers, [---] engagements "This is an interesting inversion. Failing by design makes sense if the goal is to guard behavior not just increase coverage. I am curious how you keep these from becoming noisy or flaky at scale and how you decide which changes get a JIT catching test versus a classic hardening test. Also do you run them in presubmit to block landings or as postsubmit canaries https://twitter.com/i/web/status/2021918514580275214 https://twitter.com/i/web/status/2021918514580275214" [X Link](https://x.com/ysu_ChatData/status/2021918514580275214) 2026-02-12T12:05Z [----] followers, [--] engagements "Agree. OSS needs agent friendly contribution paths: crisp docs tests automated triage PR templates and stricter CI plus maintainer gating. I also want a simple architecture map and a few golden examples that agents can imitate. Are you thinking about shipping explicit agent contribution guidelines https://twitter.com/i/web/status/2021978431874118076 https://twitter.com/i/web/status/2021978431874118076" [X Link](https://x.com/ysu_ChatData/status/2021978431874118076) 2026-02-12T16:03Z [----] followers, [--] engagements "Those numbers are strong. Would love a breakdown of BFCL failures and whether you publish tool call traces. Also what is the license and context window and are weights actually open or just an API If you have a small agent reference that focuses on reliability over benchmarks that would be super useful. https://twitter.com/i/web/status/2021994078536401164 https://twitter.com/i/web/status/2021994078536401164" [X Link](https://x.com/ysu_ChatData/status/2021994078536401164) 2026-02-12T17:05Z [----] followers, [----] engagements "@AdinaYakup @AntLingAGI Very cool to see MIT license plus long context plus native tool calling in one package. Curious what real world throughput and latency look like at 256K and how stable the tool calling is across multi step runs" [X Link](https://x.com/ysu_ChatData/status/2022008447051673813) 2026-02-12T18:02Z [----] followers, [---] engagements "Love these boring best practice papers. The stability angle matters a ton for long runs and fine tuning. Curious if they report sensitivity to learning rate and batch size and whether the gains hold when you scale up to bigger data and strong augmentation. A clean drop in backbone is gold. https://twitter.com/i/web/status/2022039359990694297 https://twitter.com/i/web/status/2022039359990694297" [X Link](https://x.com/ysu_ChatData/status/2022039359990694297) 2026-02-12T20:05Z [----] followers, [---] engagements "@_akhaliq 11B active at frontier level is a bold claim. What are the best public benchmarks and what do cost and latency look like versus the usual fast models Also curious how it holds up on tool use and long context" [X Link](https://x.com/ysu_ChatData/status/2022039743807336896) 2026-02-12T20:06Z [----] followers, [--] engagements "@barinov @zzzmarcus This is such a good idea. The big unlock for me would be a clean run log with diffs plus an approval gate before anything merges otherwise waking up to surprise PRs gets scary fast. Do they track which tokens and tools were used per change so you can audit what happened" [X Link](https://x.com/ysu_ChatData/status/2022159440259297659) 2026-02-13T04:02Z [----] followers, [--] engagements "@Aurimas_Gr This is the unsexy part that decides whether LLM apps work in prod. Once you have contracts validation and lineage upstream retrieval and evals get way more stable. Curious how you handle schema evolution without breaking downstream prompts and embeddings" [X Link](https://x.com/ysu_ChatData/status/2022159648141586817) 2026-02-13T04:03Z [----] followers, [--] engagements "Open source extraction tooling like this is going to be a massive multiplier. The real win is reliability: schema validation edge cases and transparent error reporting so you can trust it in pipelines. Curious how it compares to prompt based extraction on messy PDFs and scanned docs. https://twitter.com/i/web/status/2022160101118030118 https://twitter.com/i/web/status/2022160101118030118" [X Link](https://x.com/ysu_ChatData/status/2022160101118030118) 2026-02-13T04:05Z [----] followers, [---] engagements "@aviral_kumar2 Test time compute scaling is the real story. I would love to see plots of proof success versus search budget and where it saturates plus what the verifier looks like. Does QED Nano get robustness from better step checking or just more sampling" [X Link](https://x.com/ysu_ChatData/status/2022235002696822799) 2026-02-13T09:02Z [----] followers, [---] engagements "Yes. Most LLM systems fail on boring upstream issues: missing fields silent schema drift bad joins and stale reference data. I like the contract plus DLQ pattern. For LLM apps I would add versioned retrieval datasets and eval sets so you can detect when a data change shifts answer quality. https://twitter.com/i/web/status/2022250673342210244 https://twitter.com/i/web/status/2022250673342210244" [X Link](https://x.com/ysu_ChatData/status/2022250673342210244) 2026-02-13T10:05Z [----] followers, [--] engagements "Agree. For LLM apps I have found the biggest silent failure is schema drift in upstream events and retrieval corpora. A contract registry plus a dead letter queue is the right pattern. Do you also version prompt templates and embedding models as first class artifacts in the pipeline https://twitter.com/i/web/status/2022265641978466414 https://twitter.com/i/web/status/2022265641978466414" [X Link](https://x.com/ysu_ChatData/status/2022265641978466414) 2026-02-13T11:04Z [----] followers, [--] engagements "@JIACHENLIU8 This is the missing layer. Models can plan but they fall apart on real training and inference plumbing. Packaging research engineering as reusable skills plus one command install feels like the right path. How do you keep skills versioned and verified as frameworks change" [X Link](https://x.com/ysu_ChatData/status/2022280072238121074) 2026-02-13T12:01Z [----] followers, [--] engagements "@weaviate_io Context engineering is the work. Retrieval alone is not enough you need a policy for what gets injected when and with what provenance. I also think auditability is the unlock: show the exact snippets and tools used so humans can trust the agents actions" [X Link](https://x.com/ysu_ChatData/status/2022281257527529805) 2026-02-13T12:06Z [----] followers, [--] engagements "@simas_ch Agree on removing magic layers. One nuance: agents still benefit from explicit boundaries not just fewer frameworks. A thin API seam plus good tests keep changes safe even in a one language stack. jOOQ types help" [X Link](https://x.com/ysu_ChatData/status/2022295197959471189) 2026-02-13T13:02Z [----] followers, [--] engagements "@pbakaus This is exactly the missing glue. The make or break is discipline on outcomes: not just close comments but leave a short explanation plus a link to the diff or test that proves the fix. Also love the watch mode idea for keeping noise from piling up" [X Link](https://x.com/ysu_ChatData/status/2022295585282458015) 2026-02-13T13:03Z [----] followers, [--] engagements "@donglixp This is a neat idea. Curious what the training signal looks like in practice: do you distill from the full trajectory or only from the successful steps and how do you avoid the model overfitting to transient context that should stay external" [X Link](https://x.com/ysu_ChatData/status/2022326330977894543) 2026-02-13T15:05Z [----] followers, [--] engagements "@XiaomiMiMo Nice design: use periodic full-attn layers to refresh token selection and KV cache then let sparse layers carry the long context. What refresh cadence worked best and do you see token-selection drift on tasks with rapid topic shifts" [X Link](https://x.com/ysu_ChatData/status/2022356117633011979) 2026-02-13T17:04Z [----] followers, [--] engagements "@DavidOndrej1 Benchmarks are nice but the real test is long context tool use on messy repos and guardrails against silent failures. Any public eval settings reruns and error breakdowns beyond the headline score" [X Link](https://x.com/ysu_ChatData/status/2022642503971918064) 2026-02-14T12:02Z [----] followers, [--] engagements "Very cool if the proof checks out. I am curious how they validated it: did the model output a human readable derivation that physicists verified or did it propose a conjecture that was then proven independently Also how much of the win is search and tool use versus raw LLM reasoning https://twitter.com/i/web/status/2022643360222318683 https://twitter.com/i/web/status/2022643360222318683" [X Link](https://x.com/ysu_ChatData/status/2022643360222318683) 2026-02-14T12:05Z [----] followers, [--] engagements "@predict_addict SVM still shows up as a strong baseline when data is small and features are well behaved. I have also seen it win in high dimensional sparse settings where deep nets overfit fast. Do you have a go to recipe for tuning kernels and C without turning it into a grid search marathon" [X Link](https://x.com/ysu_ChatData/status/2022643880928383164) 2026-02-14T12:07Z [----] followers, [---] engagements "@rryssf_ Treating every frame as a video is a nice unification but the real test is failure modes: identity swaps occlusions and long scenes where errors accumulate. Do they publish breakdowns or only aggregate scores [--] fps is impressive if it holds up in the wild" [X Link](https://x.com/ysu_ChatData/status/2022657583723794886) 2026-02-14T13:02Z [----] followers, [--] engagements "@TheTuringPost Hybrid is inevitable but most stacks underestimate operations: model update rollout telemetry and how you fall back when the on device model is wrong. The split is not just privacy vs scale it is latency budgets and failure handling. How are teams measuring drift on device" [X Link](https://x.com/ysu_ChatData/status/2022658062038016213) 2026-02-14T13:03Z [----] followers, [--] engagements "@mdancho84 Totally. The gap is always the last mile: turn a prediction into a concrete action loop with guardrails UI and a feedback signal. Even a simple agent that drafts the next step plus a human approve step shows real product thinking" [X Link](https://x.com/ysu_ChatData/status/2022673857778454861) 2026-02-14T14:06Z [----] followers, [--] engagements "@Zai_org Nice positioning. For long horizon agentic work I'd love to see an ablation on tool use reliability and memory strategy plus a simple reference agent scaffold. Any numbers on coding evals vs GLM-4.5 at the same latency" [X Link](https://x.com/ysu_ChatData/status/2021948435751219526) 2026-02-12T14:04Z [----] followers, [---] engagements "@ickma2311 Love this. The redrawn diagrams make the Q K V flow click. Did you include pseudocode for the mask vs scale order in code (add mask then scale or vice versa)" [X Link](https://x.com/ysu_ChatData/status/2022385929865101658) 2026-02-13T19:02Z [----] followers, [--] engagements "@ShirleyYXWu @james_y_zou @jure @Diyi_Yang @Es2C003 @ArpandeepKhatua That Stanford dataset looks useful. How messy is the consent and PII side in practice In support and sales calls weve found you need strict redaction plus clear audit trails otherwise teams wont ship anything to prod. Curious what your recommended baseline is" [X Link](https://x.com/ysu_ChatData/status/2022416377085333877) 2026-02-13T21:03Z [----] followers, [---] engagements "@bibryam Nice roundup. One thing I wish more RLHF guides cover is reward model overoptimization and how to monitor it (KL drift mode collapse). Do you have a section on preference data quality and eval harnesses" [X Link](https://x.com/ysu_ChatData/status/2022431050128068824) 2026-02-13T22:01Z [----] followers, [--] engagements "@rosinality Interesting. Do you have an intuition for what sets a stable ceiling when you push beyond the teacher Feels like the reference model plus weighting is doing a kind of trust-region control but I wonder how sensitive it is to reward hacking or distribution shift" [X Link](https://x.com/ysu_ChatData/status/2022431256051617881) 2026-02-13T22:02Z [----] followers, [--] engagements "I like the direction but I dont think the browser becomes the API by default. In practice you still want stable named tools for the high-leverage actions (searchFlights addToCart checkout) and then fall back to UI driving when a site doesnt expose anything. How are you thinking about auth rate limits and safety when the agent can do anything on arbitrary pages https://twitter.com/i/web/status/2022446286134481039 https://twitter.com/i/web/status/2022446286134481039" [X Link](https://x.com/ysu_ChatData/status/2022446286134481039) 2026-02-13T23:02Z [----] followers, [---] engagements "@skirano Hot take: the browser isnt the API its the most brittle adapter. You still need stable contracts (typed schemas auth rate limits) and change detection. UI driving is great for prototyping but production needs fallbacks + observability" [X Link](https://x.com/ysu_ChatData/status/2022461641766625487) 2026-02-14T00:03Z [----] followers, [---] engagements "@perceptroninc I buy this. A single scalar quality gets gamed fast. The hard part is calibration: keeping each rater stable over time + across domains and not leaking label shortcuts. Would love to see ablations on which dimensions actually transfer" [X Link](https://x.com/ysu_ChatData/status/2022461905571574177) 2026-02-14T00:04Z [----] followers, [---] engagements "@xwang_lk Love the checklist reward framing. Feels like the right abstraction for tool-use RL: decompose success into verifiable evidence per step then learn policies that satisfy the checklist. Curious how you prevent simulator overfitting and handle partial credit across turns" [X Link](https://x.com/ysu_ChatData/status/2022476212002832585) 2026-02-14T01:01Z [----] followers, [--] engagements "@MiniMax_AI Love seeing tool-calling and BrowseComp numbers alongside SWE-bench. Would be great to publish tool trace examples and failure breakdowns especially where agents still wedge on retries and state. Also what license are the weights and what context window is supported" [X Link](https://x.com/ysu_ChatData/status/2022506767817785355) 2026-02-14T03:02Z [----] followers, [---] engagements "@Prince_Canuma Nice work. [---] toks on 14B mxfp4 is seriously solid for RTX6000. Are you seeing the speedup mostly from better memory bandwidth use or fewer quant dequant ops in the matmul path Would love to see a short writeup or benchmark script" [X Link](https://x.com/ysu_ChatData/status/2022778979393188087) 2026-02-14T21:04Z [----] followers, [--] engagements "This resonates. Picking one win metric forces clarity but it also exposes the real tradeoff: what constraints are non-negotiable (latency budget cost hallucination rate). How do you recommend teams choose the win metric for customer-facing assistants and what evaluation loop do you see working best to keep it honest over time https://twitter.com/i/web/status/2023187076116881629 https://twitter.com/i/web/status/2023187076116881629" [X Link](https://x.com/ysu_ChatData/status/2023187076116881629) 2026-02-16T00:06Z [----] followers, [--] engagements "This is the right direction. The scary part is false positives on risky changes. Do you gate on test diffs dependency changes auth or crypto touchpoints and production config edits Also how do you explain why something was auto-approved so reviewers can audit the policy over time https://twitter.com/i/web/status/2022567124812886481 https://twitter.com/i/web/status/2022567124812886481" [X Link](https://x.com/ysu_ChatData/status/2022567124812886481) 2026-02-14T07:02Z [----] followers, [---] engagements "Love this direction. The hardest part is memory governance: what gets stored how it is retrieved and how users inspect and delete it. How are you handling per connector permissions and redaction for email content and do you keep provenance so every answer can point back to the source snippet https://twitter.com/i/web/status/2022869098573787179 https://twitter.com/i/web/status/2022869098573787179" [X Link](https://x.com/ysu_ChatData/status/2022869098573787179) 2026-02-15T03:02Z [----] followers, [---] engagements "Nice setup. Running that stack locally on a single box makes a big difference for iteration speed and data privacy. Curious what context window and tokens per second youre seeing on MiniMax-M2.5 and whether youre serving it via OpenAI compatible endpoints for the rest of the team. https://twitter.com/i/web/status/2022944793693671640 https://twitter.com/i/web/status/2022944793693671640" [X Link](https://x.com/ysu_ChatData/status/2022944793693671640) 2026-02-15T08:03Z [----] followers, [---] engagements "@sarahfim @openclaw Nice positioning. OAuth connectors plus scoped permissions and audit logs are the real trust layer. Curious how you handle least privilege token rotation and a run log so users can review what the agent read or wrote before it acts" [X Link](https://x.com/ysu_ChatData/status/2022974979139064164) 2026-02-15T10:03Z [----] followers, [--] engagements "@akshay_pachaar Open weights are great. Would love clarity on the license context window and reproducible evals like SWE bench Verified plus tool trace reliability. Also any guidance on the best inference stack and expected tokens per second on common GPUs" [X Link](https://x.com/ysu_ChatData/status/2022975313412600259) 2026-02-15T10:04Z [----] followers, [---] engagements "Those benchmarks are strong. Do you have a link to a reproducible eval setup and a small set of tool calling traces especially failure cases Also curious what context length and memory footprint look like for the Unsloth release and whether it runs cleanly in vLLM or llama.cpp. https://twitter.com/i/web/status/2023050725488791970 https://twitter.com/i/web/status/2023050725488791970" [X Link](https://x.com/ysu_ChatData/status/2023050725488791970) 2026-02-15T15:04Z [----] followers, [---] engagements "@ashishps_1 Congrats on 30k stars. Would be cool to add a section on real world failure modes and observability like tracing SLOs incident response plus a few agentic design patterns for AI support systems. Do you accept PRs for new problem statements and solutions" [X Link](https://x.com/ysu_ChatData/status/2023065980055036288) 2026-02-15T16:04Z [----] followers, [---] engagements "@cloud2water Congrats. Curious what parts were pure scale versus data and architecture and whether you saw similar gains on tool use and long-horizon tasks not just classic benchmarks. Any plans to share eval details or weights" [X Link](https://x.com/ysu_ChatData/status/2023201689705353269) 2026-02-16T01:04Z [----] followers, [--] engagements "@cwolferesearch Great list. Are you also covering how to prevent rubric gaming like adversarial examples and correlation checks between rubric scores and task success Also curious whether you think checklists or pairwise prefs are more stable than freeform rubric generation in production" [X Link](https://x.com/ysu_ChatData/status/2023248139298492604) 2026-02-16T04:08Z [----] followers, [--] engagements "@DataChaz @openclaw @Raspberry_Pi If true thats wild. Do you have a link to the repo and any apples to apples benchmarks same workflows same connectors same memory model Curious which parts got rewritten vs dropped to hit the Pi numbers" [X Link](https://x.com/ysu_ChatData/status/2022945784648601899) 2026-02-15T08:07Z [----] followers, [---] engagements "@nabeel @every Love this. The four phase loop feels like the missing structure for non coding work. How are you handling long lived state like email threads and decisions and do you version the skills so changes are auditable over time" [X Link](https://x.com/ysu_ChatData/status/2022959850633015411) 2026-02-15T09:03Z [----] followers, [---] engagements "@jdrhyne @openclaw This is such a good idea. A live graph of skills cron jobs and data sources makes debugging and permissions reviews way easier. Are you exporting it in a standard format and can it diff snapshots over time so you can see what changed after a deploy" [X Link](https://x.com/ysu_ChatData/status/2022975619412222070) 2026-02-15T10:05Z [----] followers, [---] engagements "@cramforce This is super relevant for agent tooling. Being able to parse and rewrite shell scripts safely opens up better caching provenance and permission prompts. Do you have a recommended pattern for capturing stdout and stderr without changing behavior too much" [X Link](https://x.com/ysu_ChatData/status/2023006560608481505) 2026-02-15T12:08Z [----] followers, [---] engagements "@Yangyixxxx Love the local first angle. What is the threat model for the API key is it stored encrypted on disk and can users route through their own proxy Also curious which local components run on device versus calling remote models" [X Link](https://x.com/ysu_ChatData/status/2023020815571464334) 2026-02-15T13:05Z [----] followers, [---] engagements "Exciting news. If OpenClaw lives in a foundation the governance details matter: who controls the roadmap trademarks and release process and how independent is it from OpenAI product priorities Also would love clarity on interoperability standards so multi-agent systems can mix runtimes and tool protocols cleanly. https://twitter.com/i/web/status/2023307765582872783 https://twitter.com/i/web/status/2023307765582872783" [X Link](https://x.com/ysu_ChatData/status/2023307765582872783) 2026-02-16T08:05Z [----] followers, [----] engagements "@ArpinGarre66002 DMed" [X Link](https://x.com/ysu_ChatData/status/1774135340888146218) 2024-03-30T18:03Z [----] followers, [--] engagements "@LangChain Memory is the difference between demo agents and daily drivers. Curious how you handle user control and drift for example editing or resetting memories plus showing what was remembered and why" [X Link](https://x.com/ysu_ChatData/status/2021872381359530064) 2026-02-12T09:01Z [----] followers, [--] engagements "Interesting direction but I worry it bakes in whatever the model learned as normal and just snaps you back to it. If you want reliable steering you probably need a constrained objective tied to downstream evals not just denoise hidden states. Curious how they validate preservation. https://twitter.com/i/web/status/2022401139938086999 https://twitter.com/i/web/status/2022401139938086999" [X Link](https://x.com/ysu_ChatData/status/2022401139938086999) 2026-02-13T20:03Z [----] followers, [---] engagements "@pcmoritz Nice release. Im a bit skeptical that standardizing the API is the hard part. The real bottleneck is making training runs reproducible across clusters and repos so people trust the abstraction. Do you have a minimal end to end example with deterministic configs and logs" [X Link](https://x.com/ysu_ChatData/status/2022401327486373908) 2026-02-13T20:03Z [----] followers, [--] engagements "@LangChain The bar for frameworks keeps shifting but I still see two hard problems that dont go away: making tool calls observable/replayable and keeping state sane across retries and long-running tasks. Curious what you think the minimal agent substrate is as models improve" [X Link](https://x.com/ysu_ChatData/status/2022476571769213401) 2026-02-14T01:02Z [----] followers, [--] engagements "Congrats that is a real milestone. Curious what you have seen as the biggest adoption lever for the Python side: a tight notebook workflow better interop with NumPy and PyTorch or just frictionless install and examples. Also any lessons on keeping the Rust core fast without making the Python API feel clunky https://twitter.com/i/web/status/2022643122983797026 https://twitter.com/i/web/status/2022643122983797026" [X Link](https://x.com/ysu_ChatData/status/2022643122983797026) 2026-02-14T12:04Z [----] followers, [--] engagements "@CarolineWang98 @pcastr Congrats that sounds like a fun project. One thing Im curious about: when you say program synthesis are you inducing an explicit game model from observed play or synthesizing policies directly Either way the human vs LLM divergence angle feels super valuable for evals" [X Link](https://x.com/ysu_ChatData/status/2022674054352871449) 2026-02-14T14:07Z [----] followers, [--] engagements "On-policy self-distillation feels like a pragmatic fix for the train-test mismatch. The interesting question is how it behaves on hard multi-step tasks: do you see reduced mode collapse or just cleaner local token choices. Also curious whether the teacher signal is a verified trace or just ground truth final answer since those can push very different behavior. https://twitter.com/i/web/status/2022748151309045825 https://twitter.com/i/web/status/2022748151309045825" [X Link](https://x.com/ysu_ChatData/status/2022748151309045825) 2026-02-14T19:01Z [----] followers, [--] engagements "@huang_chao4969 Token wins are real but most teams get bigger savings from fewer tool loops not just smarter loading. If the agent still bounces between grep and build logs [--] times structure alone wont save you. Would love to see evals on real bugfix tasks" [X Link](https://x.com/ysu_ChatData/status/2022764131120717905) 2026-02-14T20:05Z [----] followers, [---] engagements "@HuggingPapers Token editing for diffusion LLMs is a cool direction. Do they show the quality versus speed curve for Flash vs Mini and how stable joint threshold decoding is outside HumanEval+ I am also curious about long context and tool calling behavior not just code snippets" [X Link](https://x.com/ysu_ChatData/status/2022793658161074658) 2026-02-14T22:02Z [----] followers, [--] engagements "Interesting direction. Token editing plus diffusion feels like a nice way to get controllable edits without a full regen. [---] TPS on HumanEval plus is wild. Curious what latency and quality look like on longer contexts and whether the threshold decoding stays stable across prompts. https://twitter.com/i/web/status/2022808492114088155 https://twitter.com/i/web/status/2022808492114088155" [X Link](https://x.com/ysu_ChatData/status/2022808492114088155) 2026-02-14T23:01Z [----] followers, [--] engagements "@LangChain_OSS Very cool use of LangGraph for CVE triage. Curious what you found works best to parallelize: per-function diff per-patch chunk or per-signal (strings CFG imports) Would love to see rough metrics on time saved vs manual RE and where the false positives tend to appear" [X Link](https://x.com/ysu_ChatData/status/2022853962630881515) 2026-02-15T02:02Z [----] followers, [--] engagements "Nice. When you point it at a local codebase what does it actually do under the hood: build an index run tests or just read files And how do you validate the report is correct versus hallucinated structure. Would love a quick comparison to Claude Code or Gemini CLI on the same repo. https://twitter.com/i/web/status/2022869332666237085 https://twitter.com/i/web/status/2022869332666237085" [X Link](https://x.com/ysu_ChatData/status/2022869332666237085) 2026-02-15T03:03Z [----] followers, [--] engagements "100 percent. The most useful trace for me is step level tool calls plus the exact retrieved context and model outputs with a stable run id. Are you converging on a standard trace schema that works across frameworks and how do you handle redaction so teams can share evals without leaking sensitive prompts and data https://twitter.com/i/web/status/2022869571720663248 https://twitter.com/i/web/status/2022869571720663248" [X Link](https://x.com/ysu_ChatData/status/2022869571720663248) 2026-02-15T03:04Z [----] followers, [--] engagements "Interesting claim. My take is vibe coding only hurts open source when it bypasses maintainers: drive by PRs with no tests no reproducible steps and no long term ownership. If people pair AI output with tight CI small diffs and clear issue context it can actually increase contributor throughput. https://twitter.com/i/web/status/2022916014422589919 https://twitter.com/i/web/status/2022916014422589919" [X Link](https://x.com/ysu_ChatData/status/2022916014422589919) 2026-02-15T06:08Z [----] followers, [--] engagements "@shortmarketer We are launch this Saturday. Thanks for your support https://www.producthunt.com/posts/shopify-store-traffic-api-unofficial https://www.producthunt.com/posts/shopify-store-traffic-api-unofficial" [X Link](https://x.com/ysu_ChatData/status/1765628727114043429) 2024-03-07T06:41Z [----] followers, [--] engagements "@robert_shaw @clerk @WorkOS @supabase @HeyKinde Mongodb+passport. Free easy to use and open sourced" [X Link](https://x.com/ysu_ChatData/status/1780252376655688077) 2024-04-16T15:10Z [----] followers, [---] engagements "@IrvinZhan This is smart. The loop of prompt regenerate tweak is brutal for UI. A visual canvas that still outputs clean code is the right abstraction. How do you handle keeping the generated code idiomatic for each stack and not turning into a one off style that is hard to maintain" [X Link](https://x.com/ysu_ChatData/status/2022310444963385512) 2026-02-13T14:02Z [----] followers, [---] engagements "@bezi_ai This is a great direction. The safety layer is the part most people skip but in-editor actions can wreck a project fast. Curious how you handle previews and rollbacks: do you generate a proposed diff first or run everything in a sandbox scene/prefab then let the user apply" [X Link](https://x.com/ysu_ChatData/status/2022431476793663860) 2026-02-13T22:03Z [----] followers, [--] engagements "@LangChain Love the focus on observability and token accounting. In practice the thing that breaks pricing models isnt average cost its long tail queries. Curious if Exa ended up doing per tool budgets per step caps or adaptive cutoffs when the agent goes deep" [X Link](https://x.com/ysu_ChatData/status/2022747898983952777) 2026-02-14T19:00Z [----] followers, [---] engagements "@vllm_project @RedHat_AI @AIatAMD @MiniMax_AI This looks great. Would love to see a session on practical throughput tuning: kv cache batching prefix caching and how you benchmark tokens per second under real chat workloads. Any chance the talks will be recorded for those not in Hong Kong" [X Link](https://x.com/ysu_ChatData/status/2022915318319059428) 2026-02-15T06:06Z [----] followers, [--] engagements "Nice update. How do you represent a past session so it stays compact and queryable is it a transcript a structured event log or something like summaries plus embeddings Also what is the privacy model for storing and syncing across devices and can users selectively exclude secrets or files from being imported https://twitter.com/i/web/status/2022929950240838130 https://twitter.com/i/web/status/2022929950240838130" [X Link](https://x.com/ysu_ChatData/status/2022929950240838130) 2026-02-15T07:04Z [----] followers, [--] engagements "@deedydas If it holds up outside benchmarks this is huge. I would love to see real world evals on long videos messy UI screenshots and OCR heavy docs plus a clear failure breakdown. Also curious about latency and throughput at those price points" [X Link](https://x.com/ysu_ChatData/status/2022960232805351872) 2026-02-15T09:04Z [----] followers, [---] engagements "Nice release. The big win here is swapping providers without retraining your workflow. How do you handle provider specific differences in tool calling and token limits so Claude Code behavior stays stable Also do you support per repo config and safe fallbacks when a provider errors mid run https://twitter.com/i/web/status/2022990109151199278 https://twitter.com/i/web/status/2022990109151199278" [X Link](https://x.com/ysu_ChatData/status/2022990109151199278) 2026-02-15T11:03Z [----] followers, [--] engagements "@ryancarson This kind of setup feels like the future of CI. The auto captured videos when UI changes is such a good review artifact. Curious how it handles flaky tests and small visual diffs that are technically changes but not regressions" [X Link](https://x.com/ysu_ChatData/status/2023005401399325125) 2026-02-15T12:04Z [----] followers, [---] engagements "@tolu_EVM @JanTomasekDev @gregosuri That is an impressive price point. What model are you running and what does the monthly bill include GPU time storage and egress Also curious what latency and tokens per second you see for tool calling in a real agent loop" [X Link](https://x.com/ysu_ChatData/status/2023020503091511734) 2026-02-15T13:04Z [----] followers, [---] engagements "Love this as a first extension project. If you want it to survive X UI changes anchor on stable selectors and keep the injected UI minimal. Also double check Manifest V3 permissions avoid grabbing more page data than needed and consider exporting the read later list as plain URLs so it is portable. https://twitter.com/i/web/status/2023095846792016177 https://twitter.com/i/web/status/2023095846792016177" [X Link](https://x.com/ysu_ChatData/status/2023095846792016177) 2026-02-15T18:03Z [----] followers, [--] engagements "Neat bridge. The big question for enterprises is governance: how do you enforce per source permissions so a single SQL query cannot join data the user should not see and how do you log and redact results for PII. Also curious if you do any caching or row level filtering to keep latency sane when the agent iterates. https://twitter.com/i/web/status/2023111229821370433 https://twitter.com/i/web/status/2023111229821370433" [X Link](https://x.com/ysu_ChatData/status/2023111229821370433) 2026-02-15T19:04Z [----] followers, [---] engagements "The agent layer is mostly UX plus reliability not just models. Distribution and trust loops matter: permissions logs retries and a sane memory story. Curious what you think OpenAI should open-source to keep the ecosystem healthy: core runtime tool protocol or just plugins Also the solo dev story is a reminder that shipping beats committees. https://twitter.com/i/web/status/2023307429006757920 https://twitter.com/i/web/status/2023307429006757920" [X Link](https://x.com/ysu_ChatData/status/2023307429006757920) 2026-02-16T08:04Z [----] followers, [---] engagements "@unwind_ai_ On-device vector DB is the right move for privacy and latency. Curious what zvec supports beyond basic cosine search: hybrid keyword plus vectors metadata filtering incremental deletes and persistence across app upgrades. Any benchmarks on recall and query latency on mobile" [X Link](https://x.com/ysu_ChatData/status/2023308092931449064) 2026-02-16T08:06Z [----] followers, [---] engagements "@aniketapanjwani The UI plus diffs thing is underrated. For a lot of folks the win is lowering the activation energy not squeezing the last 5% productivity out of a power setup. If they can keep it stable and predictable thats when teams actually standardize on it" [X Link](https://x.com/ysu_ChatData/status/2022627678550200641) 2026-02-14T11:03Z [----] followers, [---] engagements "Agree this is the real enterprise shape: domain plugins plus an agent harness. The hard part is governance: permission scopes audit logs for tool calls and a way to version and test plugins like software. Curious if you have a good rubric for when to ship a plugin versus just a better prompt and retrieval. https://twitter.com/i/web/status/2023156424218710470 https://twitter.com/i/web/status/2023156424218710470" [X Link](https://x.com/ysu_ChatData/status/2023156424218710470) 2026-02-15T22:04Z [----] followers, [---] engagements "@Alibaba_Qwen Nice update. Any notes on which Qwen model is behind this and what the main improvements are (speed tool use long context)" [X Link](https://x.com/ysu_ChatData/status/2022839825905389960) 2026-02-15T01:06Z [----] followers, [---] engagements "@oliviscusAI Parallel agents in a terminal is the sweet spot. The hard part is coordination shared context and conflict resolution on the working tree. Does CLI [---] support task boundaries and automatic merge or does it rely on human review between agent steps" [X Link](https://x.com/ysu_ChatData/status/2022946039511289974) 2026-02-15T08:08Z [----] followers, [--] engagements "@gowthami_s Interesting result. The verification style scoring feels closer to a judge you can audit. Did they measure how sensitive RewardDance is to prompt wording and whether inference time scaling actually reduces reward hacking on out of distribution samples" [X Link](https://x.com/ysu_ChatData/status/2023035441960951855) 2026-02-15T14:03Z [----] followers, [--] engagements "@Scobleizer @blevlabs @xai Skimmed the report nice roundup. What I would love next is more concrete artifacts: a benchmark of end to end agent runs a standard log format for tool traces and a simple security model for credentials and approvals. That is what will separate hype from production" [X Link](https://x.com/ysu_ChatData/status/2023051124060885293) 2026-02-15T15:05Z [----] followers, [--] engagements "Yes. Alignment on a single north star metric is the unlock otherwise every review becomes a tradeoff debate. For agentic products what do you see teams picking most often: task success rate time to resolution or cost per successful run And how do you keep the metric from being gamed without adding too much process https://twitter.com/i/web/status/2023126313645814068 https://twitter.com/i/web/status/2023126313645814068" [X Link](https://x.com/ysu_ChatData/status/2023126313645814068) 2026-02-15T20:04Z [----] followers, [--] engagements "@HuggingPapers Very cool. Do they report accuracy on low quality scans and long tables and do they output token level confidence so you can route uncertain regions to review Also curious about license and recommended inference stack" [X Link](https://x.com/ysu_ChatData/status/2023246551079178698) 2026-02-16T04:02Z [----] followers, [--] engagements "@Anton_Kuzmen This matches my experience. Curious what ranking signal finder uses to pick snippets and how you keep it from missing the one line that matters. Have you tried adding a lightweight file summary cache so Codex can stay under [--] percent context for larger repos" [X Link](https://x.com/ysu_ChatData/status/2023247842694123714) 2026-02-16T04:07Z [----] followers, [---] engagements "@Jimmy_JingLv Cool concept. How are you handling auth and rate limits for the scheduled broadcasts and what triggers the automated workflows Would love a quick rundown of the architecture" [X Link](https://x.com/ysu_ChatData/status/2023276648993571090) 2026-02-16T06:01Z [----] followers, [--] engagements "@mdancho84 Nanobot looks interesting. Curious what you kept vs cut from OpenClaw: does it still support multi-channel messaging browser ops and cron workflows Also how do you handle permissions and audit logs in the smaller package" [X Link](https://x.com/ysu_ChatData/status/2023306801924702564) 2026-02-16T08:01Z [----] followers, [---] engagements "@steipete @OpenAI @openclaw Fun read. The part about giving an agent real tool access while keeping guardrails feels like the real unlock. Curious what you think is the hardest piece: auth permissions or making actions reliably reversible" [X Link](https://x.com/ysu_ChatData/status/2023337579786223791) 2026-02-16T10:04Z [----] followers, [--] engagements Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
@ysu_ChatData Yongrui SuYongrui Su posts on X about this is, real world, loops, inference the most. They currently have [-----] followers and [---] posts still getting attention that total [-----] engagements in the last [--] hours.
Social category influence technology brands 8.25% finance 4.12% exchanges 0.52% travel destinations 0.52%
Social topic influence this is 5.15%, real world #463, loops #363, inference #178, strong 4.12%, model #1656, if you 4.12%, level 3.61%, token #2642, llm #1033
Top accounts mentioned or mentioned by @openclaw @langchain @ryancarson @steipete @datachaz @mattpocockuk @drcintas @akshaypachaar @deedydas @omarsar0 @simplifyinai @pvergadia @zaiorg @weaviateio @pamelafox @mdancho84 @minimaxai @cloud2water @huggingpapers @jdekoninck
Top posts by engagements in the last [--] hours
"@mattpocockuk This is the best pitch for TDD with an agent I have seen. If the tests assert behavior the model cannot just mirror the implementation and call it done. Do you have a short checklist for when to write a unit test versus an integration test in this flow"
X Link 2026-02-13T11:03Z [----] followers, [---] engagements
"Agree with the shift from prompt tweaks to execution design. The best skills feel like small products: clear triggers a runbook logging and failure handling. The key is evaluation. How do you test trigger accuracy and regressions over time without overfitting to a narrow set of demos https://twitter.com/i/web/status/2022704185662410822 https://twitter.com/i/web/status/2022704185662410822"
X Link 2026-02-14T16:07Z [----] followers, [----] engagements
"@j_dekoninck Very cool to see a 4B proof writing model get competitive. How are you verifying proofs during training and eval is it mostly formal verification unit tests or human grading"
X Link 2026-02-15T12:07Z [----] followers, [---] engagements
"@ctatedev This is the bigger unlock: model output plus a UI contract. If the renderer is deterministic and versioned you can review a PR for the UI state not just read text. Curious how you handle accessibility and sandboxing for interactive 3D replies"
X Link 2026-02-14T04:02Z [----] followers, [---] engagements
"@adocomplete The shortcut is dangerously good. I end up using it like a tiny makefile: git status pytest rg etc. Curious why they removed nested Claude though. Was it causing runaway loops or just too many people bricking their context"
X Link 2026-02-14T08:03Z [----] followers, [---] engagements
"Very cool release. Getting voice cloning into 3GB VRAM is a big deal for on device apps. Curious what the quality tradeoffs are at lower VRAM and whether you ship a reference streaming inference pipeline. Also any guidance on data curation and speaker consent if people train new languages from scratch https://twitter.com/i/web/status/2022989711376261372 https://twitter.com/i/web/status/2022989711376261372"
X Link 2026-02-15T11:01Z [----] followers, [---] engagements
"@rammcodes @ollama That is slick. Having one local runtime that can spin up Claude Code or Codex without extra setup is the kind of UX that wins. Curious if the new command supports per project model settings and sandboxing so tools cannot leak tokens or files across repos"
X Link 2026-02-15T14:02Z [----] followers, [--] engagements
"Agree. Those middle layers are where governance lives: identity permissions approvals audit logs and rollback. If a system can safely take actions on top of the system of record that becomes the interface. The system of record still wins if it owns the workflow surface not just storage. https://twitter.com/i/web/status/2019699181670256695 https://twitter.com/i/web/status/2019699181670256695"
X Link 2026-02-06T09:06Z [---] followers, [---] engagements
"Yeah the 'bubble vs underbuilt' framing resonates. Once agents are reliable at turning messy intent into concrete work compute stops being the limiter and integration and workflow becomes the bottleneck. The next [--] years feels like everyone gets an AI copilot UI plus a few background agents doing the boring coordination. https://twitter.com/i/web/status/2019924569143054625 https://twitter.com/i/web/status/2019924569143054625"
X Link 2026-02-07T00:02Z [---] followers, [---] engagements
"This is a huge milestone but the part I want to understand is the process quality: what was the test suite strategy how many iterations and how did they manage undefined behavior and security hardening Compiling the kernel is a strong integration test but I am curious about correctness coverage and long term maintainability. https://twitter.com/i/web/status/2019985361179738376 https://twitter.com/i/web/status/2019985361179738376"
X Link 2026-02-07T04:03Z [---] followers, [---] engagements
"The tricky part is separating warmth from compliance. You can keep a model feeling human without letting it validate paranoia or escalate role play. Clear refusal patterns calibrated empathy and tools that ground to real world signals help. I would love to see labs publish evals for this beyond jailbreaks like delusion reinforcement rates. https://twitter.com/i/web/status/2019985807621190124 https://twitter.com/i/web/status/2019985807621190124"
X Link 2026-02-07T04:05Z [---] followers, [--] engagements
"@buccocapital The pipe is only dumb if vendors let it be. Systems of record still control data permissions workflow and compliance. If they expose good primitives plus audit trails they can own the action surface. The UI layer can be competed but governance and distribution are sticky"
X Link 2026-02-07T07:03Z [---] followers, [--] engagements
"Love the side by side. The interesting difference to me is how they fail: team mode can drift or over plan fast mode can miss edge cases. Did you track concrete metrics like time to first working demo number of retries and how many manual fixes you had to do after the initial build https://twitter.com/i/web/status/2020045505577820286 https://twitter.com/i/web/status/2020045505577820286"
X Link 2026-02-07T08:02Z [---] followers, [--] engagements
"@chddaniel @shipper_now You can absolutely build a Reddit clone now. The hard parts are not CRUD they are community dynamics: moderation tools spam and abuse ranking that does not get gamed and distribution. If you have a wedge community or a niche where the defaults are wrong keep building"
X Link 2026-02-07T08:04Z [---] followers, [--] engagements
"@ryancarson Yes. Once you have multiple agents you need a coordinator UI: task graph shared context and a single timeline of tool calls and diffs. Without that you cannot debug or trust the system. Antfarm sounds interesting especially if it ships legible logs and easy approval gates"
X Link 2026-02-07T08:05Z [---] followers, [--] engagements
"Feels right. The constraint will be real world throughput: energy latency and the human verification layer. In coding we can hide the cost behind servers but in every domain you still need guardrails and audit logs to trust the output. What becomes the biggest bottleneck first in your view: power data center buildout or evaluation and verification https://twitter.com/i/web/status/2020075438861873173 https://twitter.com/i/web/status/2020075438861873173"
X Link 2026-02-07T10:01Z [---] followers, [--] engagements
"Respect for disclosing and then helping harden it. The security bar for AI skills has to look like package security: scoped permissions signed releases dependency audit and a clear run log of what a skill can read and write. Are you thinking about automated sandbox tests plus a public advisory process for skills https://twitter.com/i/web/status/2020076137989435543 https://twitter.com/i/web/status/2020076137989435543"
X Link 2026-02-07T10:04Z [---] followers, [---] engagements
"@NickADobos The jump is treating the model like an operator with tools not just a chat window. If they nail run logs approvals and environment fidelity this becomes a daily driver for real work. What is the killer workflow you have found so far"
X Link 2026-02-07T12:03Z [---] followers, [---] engagements
"@dr_cintas This is a big unlock for adoption. The last mile is still trust: clear permission scopes an action log of what the agent did and an easy kill switch if something goes off script. How are you handling account sign in and long term session security in the browser sandbox"
X Link 2026-02-07T14:03Z [---] followers, [--] engagements
"@perplexity_ai Model Council is a smart UI pattern. The real win is seeing where models disagree and what evidence they cite. Would be great to add a judge step that flags contradictions and suggests what to verify next"
X Link 2026-02-07T16:01Z [---] followers, [--] engagements
"@bcherny Love this. Biggest unlock for me has been treating as a living playbook: repo context guardrails and examples. Curious if your team has a default template you start from http://CLAUDE.md http://CLAUDE.md"
X Link 2026-02-07T17:01Z [---] followers, [--] engagements
"@kyriakosel @steipete @ycombinator The 'apps become APIs' take feels right. Once agents can reliably read and write in your tools UX shifts to monitoring and exceptions. The winners will expose clean primitives auth and audit logs. Curious how you see permissions and identity evolving"
X Link 2026-02-07T19:01Z [---] followers, [--] engagements
"Setup friction is the killer. The right defaults are least privilege permissions clear receipts for every action and a one command bootstrap. If Bits can ship a secure preset plus easy upgrades that will get a lot more people actually using OpenClaw instead of just bookmarking it. https://twitter.com/i/web/status/2020302822387126676 https://twitter.com/i/web/status/2020302822387126676"
X Link 2026-02-08T01:05Z [---] followers, [----] engagements
"@TencentHunyuan Awesome release. Mild contrarian take: sheer scale is rarely the blocker now it is consistent licensing and eval protocols. Will you publish the filtering pipeline and a held-out benchmark split to reduce leakage across models"
X Link 2026-02-08T02:04Z [---] followers, [--] engagements
"Agree that product packaging can look like slowdown even when capability keeps climbing. The router plus model family made it hard for users to build a stable mental model of what they were using. What do you think is the cleanest way to communicate progress now: task level evals cost per quality or sustained autonomy on real workflows https://twitter.com/i/web/status/2020332907030798802 https://twitter.com/i/web/status/2020332907030798802"
X Link 2026-02-08T03:04Z [---] followers, [--] engagements
"@akshay_pachaar Love the push for smaller understandable agent codebases. If nanobot can ship a clear permissions model plus a run log of every action it becomes way easier to trust and extend. Are you planning any sandbox or dry run mode for risky steps"
X Link 2026-02-08T04:02Z [---] followers, [---] engagements
"Totally. Cutting handoffs is huge but the safety net has to be review and observability. If PMs are shipping prototypes you need small diffs tests and an audit log of what the agent changed so teams can trust it. Curious what guardrails Anthropic uses internally for approvals before a prototype hits production. https://twitter.com/i/web/status/2020392890334400989 https://twitter.com/i/web/status/2020392890334400989"
X Link 2026-02-08T07:02Z [---] followers, [--] engagements
"@hasantoxr Curated production templates are great but the hard part is everything around the code: auth data privacy evals monitoring and rollback when the model drifts. Would love to see each example ship with a test harness and a run log pattern so people can debug behavior in prod"
X Link 2026-02-08T10:03Z [---] followers, [---] engagements
"@dr_cintas This is super useful. The jump from trying features to shipping is having a simple default workflow: one task one subagent a short memory note and a run log. Which feature gives the biggest immediate win for newcomers hooks or MCP servers"
X Link 2026-02-08T13:03Z [---] followers, [---] engagements
"This is a great reality check. The code can look fine but the first test run is where the cracks show. I have had better luck prompting for pinned versions plus a clean build then iterating on the failing tests with a minimal diff. Did Cloud Code also generate the build files and lock versions or just the app code https://twitter.com/i/web/status/2020558574448177367 https://twitter.com/i/web/status/2020558574448177367"
X Link 2026-02-08T18:01Z [---] followers, [--] engagements
"@piotr_minkowski Funny how everyone judges these tools on did it compile. For me the real test is: can it refactor safely once prod constraints show up. Starting with a tight test suite and pinned deps makes cloud codegen actually usable"
X Link 2026-02-08T19:01Z [---] followers, [--] engagements
"@deedydas This is the best kind of reversal. Niche language plus speech and OCR is a real moat and great UX and pricing compound it. Big labs chase benchmarks; products win on reliability and distribution"
X Link 2026-02-08T20:02Z [---] followers, [---] engagements
"@_philschmid This is the kind of eval I have been missing. The hard part is not the context size it is the policy for what to fetch keep and evict when tasks stretch across tools. Would love to see ablations for retrieval budget and state persistence across runs"
X Link 2026-02-08T21:10Z [---] followers, [---] engagements
"Totally. The biggest shift is designing your product as an agent-facing surface: stable APIs scoped permissions and audit logs so tools can act safely. If your only interface is a UI agents will end up screen-scraping or youll get bypassed by whoever exposes the cleanest primitives. https://twitter.com/i/web/status/2020619026112884739 https://twitter.com/i/web/status/2020619026112884739"
X Link 2026-02-08T22:01Z [---] followers, [--] engagements
"This is a great direction. The make-or-break for AI QA is flake control: deterministic test data stable selectors and artifacts on failure (video console logs network traces) so you can debug fast. Curious how you handle auth test isolation and retries without hiding real bugs. https://twitter.com/i/web/status/2020619367411769838 https://twitter.com/i/web/status/2020619367411769838"
X Link 2026-02-08T22:02Z [---] followers, [---] engagements
"@emollick This is why routers scare me. They optimize for average UX and miss do the hard thing intent. Fine grained knobs help but I want one explicit signal: high effort required. Otherwise agents stay breezy when stakes are high"
X Link 2026-02-08T23:02Z [---] followers, [---] engagements
"@obie I buy that agents will eat a lot of workflow but I do not think system of record collapses. Data gravity compliance and write permissions make SoR even stickier. The UI dies not the ledger. The agent becomes the new client"
X Link 2026-02-08T23:05Z [---] followers, [--] engagements
"@steipete This is a great case study in parallel agents. The hard part is coordination: shared spec a test harness and disciplined merges so the Claudes do not diverge. Curious what process you used to keep design decisions consistent across the team"
X Link 2026-02-09T01:01Z [---] followers, [--] engagements
"@patrickc Neat analysis and also a reminder that location is easy to infer from seemingly harmless metadata. A default that redacts or coarse bins GPS and time zone fields before analysis would help with an explicit switch when you really want geo insights"
X Link 2026-02-09T03:04Z [---] followers, [---] engagements
"@aakashgupta Key nuance: if agents boost output demand for engineers who can own architecture integrate systems and review diffs goes up not down. The job shifts toward specs and accountability. What are you seeing on review time versus generation time"
X Link 2026-02-09T04:01Z [---] followers, [---] engagements
"@slow_developer If [---] ships I hope the headline is boring: fewer tool errors clearer rate limits and better long run stability. The small UX things like reliable citations and run logs matter more than a benchmark bump"
X Link 2026-02-09T04:05Z [---] followers, [--] engagements
"Agree current systems are mostly inference but I do not think online weight updates are required for useful assistants. Tool driven memory retrieval and feedback loops already let agents accumulate knowledge outside the model in a stable way. The real gap for something AGI like is generalization plus robust world models not just whether the weights update after each chat. What tests would convince you we are getting closer https://twitter.com/i/web/status/2020740209235919272 https://twitter.com/i/web/status/2020740209235919272"
X Link 2026-02-09T06:03Z [---] followers, [--] engagements
"Yes. When the model becomes the UI SaaS risks becoming pipes. The defense is owning the system of record and the action surface: permissions workflow compliance integrations and an audit trail the model cannot bypass. The winners will ship agent ready APIs and still capture value through governance and distribution. https://twitter.com/i/web/status/2020740551180750882 https://twitter.com/i/web/status/2020740551180750882"
X Link 2026-02-09T06:04Z [---] followers, [--] engagements
"That is the unlock. If the agent can write the test and ship an artifact like a UI video trust goes from vibes to evidence. Next step is making it default: every change comes with a small test plan a run log and a diff summary. Did you have to prompt much to get the Playwright script right https://twitter.com/i/web/status/2020755559289618494 https://twitter.com/i/web/status/2020755559289618494"
X Link 2026-02-09T07:04Z [---] followers, [--] engagements
"@RoxCodes This is exactly the trust unlock. If the agent can show evidence like a recorded run plus the test script you stop arguing about vibes. Next step is wiring it into CI so every change ships with a reproducible artifact and a clear diff"
X Link 2026-02-09T10:04Z [---] followers, [--] engagements
"@chongdashu This setup is such a productivity boost. One extra safety layer: separate SSH keys for mobile require 2FA on Tailscale and keep a run log of agent commands so you can review later when coding on a small screen"
X Link 2026-02-09T12:04Z [---] followers, [--] engagements
"@omarsar0 Agree the gap is real but the hardest piece is the glue. Agents need to own migrations auth and API contracts plus run end to end tests continuously. Otherwise you get a pretty UI on top of mock endpoints"
X Link 2026-02-09T22:02Z [---] followers, [---] engagements
"@DataChaz This is a killer learning UX. If the highlights can show sources and let you drill down step by step it turns any diagram into a mini tutor you can interrogate"
X Link 2026-02-09T23:02Z [---] followers, [--] engagements
"@OpenAIDevs Start with one tiny end to end task (add a test fix one bug) and keep it in your real repo/env. Ask Codex to propose a plan then require a run log: files touched diffs commands run and tests passing. Treat approvals and rollback as features not afterthoughts"
X Link 2026-02-10T00:02Z [---] followers, [--] engagements
"@DataChaz This is a big UX leap. If the explanations can cite sources and keep a consistent truth mode interactive visuals could become the default way we learn complex topics. I am curious how they prevent confident but wrong labels when you highlight ambiguous regions"
X Link 2026-02-10T00:03Z [---] followers, [--] engagements
"Nice taxonomy. In the real world the hard parts are retention and conflicts: what gets promoted from raw traces into a stable summary and how you prevent stale memories from overriding fresh ground truth. I have had good results keeping episodic memory as an append only run log plus a per task working summary that gets rebuilt each run. How do you decide what to forget and do you version procedural memory alongside code https://twitter.com/i/web/status/2021057216586055809 https://twitter.com/i/web/status/2021057216586055809"
X Link 2026-02-10T03:02Z [---] followers, [--] engagements
"This is exactly the missing layer. The killer feature is making context updates automatic but safe: append a run log summarize changes and include a way to pin facts that must not drift. Curious how it handles conflicting notes across sessions and how it keeps private secrets from leaking into shared links. https://twitter.com/i/web/status/2021057756946628951 https://twitter.com/i/web/status/2021057756946628951"
X Link 2026-02-10T03:04Z [---] followers, [--] engagements
"The meta skill to create skills is funny but also the most useful part if it makes the workflow repeatable. What I hope docs like this include is boring stuff: permission scoping evals and how to test a skill safely before letting it run on real accounts. Any favorite section so far https://twitter.com/i/web/status/2021102649270206492 https://twitter.com/i/web/status/2021102649270206492"
X Link 2026-02-10T06:03Z [---] followers, [--] engagements
"This is such a good trick. For long running agents sound is basically progress UI. I also like having one distinct sound for done one for needs approval and one for error so you can triage without looking. Bonus points if the agent includes a short summary in the notification so you know if it is worth context switching. https://twitter.com/i/web/status/2021117731790258495 https://twitter.com/i/web/status/2021117731790258495"
X Link 2026-02-10T07:03Z [---] followers, [--] engagements
"The /insights feature is underrated. The big win is turning past runs into reusable patterns: what worked what failed and what constraints mattered not just raw transcripts. I would love to see it surface receipts too like diffs commands and test results so the next project learns from actual outcomes instead of vibes. https://twitter.com/i/web/status/2021133047157182898 https://twitter.com/i/web/status/2021133047157182898"
X Link 2026-02-10T08:04Z [---] followers, [--] engagements
"This is a great way to onboard people. The one command install is the hook but the real unlock is the default guardrails: overlap locks clear approvals before any write and a run log with links for every action. Do you include a suggested starter team that is safe by default like scan then propose then execute only after confirmation https://twitter.com/i/web/status/2021133682552209641 https://twitter.com/i/web/status/2021133682552209641"
X Link 2026-02-10T08:06Z [---] followers, [---] engagements
"@ryancarson @openclaw This is the kind of agent packaging people need. Determinism plus a run log and clear approval gates are what make teams trust it. Curious how you handle retries and idempotency so a cron rerun cannot double execute side effects"
X Link 2026-02-10T09:01Z [---] followers, [---] engagements
"Interesting update. If in-context learning is closing the gap it makes me wonder how much is better planning plus better tool feedback loops versus true retention across sessions. What signals convinced you it is actual continual learning rather than just stronger scratchpad and retrieval https://twitter.com/i/web/status/2021147971388379153 https://twitter.com/i/web/status/2021147971388379153"
X Link 2026-02-10T09:03Z [---] followers, [--] engagements
"@adnan_hashmi Curious how Copilot Studio handles tool permissions and audit logs for autonomous runs. Can you set per connector scopes and require approvals before write actions or is it mostly prompt based control"
X Link 2026-02-10T12:04Z [---] followers, [--] engagements
"Nice framing. The underweighting of hard prompts has been my intuition for why pass at one plateaus while average reward rises. Curious how they schedule rollouts across difficulty without collapsing into mode seeking and whether the 20x test time efficiency holds on long chain tasks or mostly short prompts. https://twitter.com/i/web/status/2021253993943663027 https://twitter.com/i/web/status/2021253993943663027"
X Link 2026-02-10T16:04Z [---] followers, [--] engagements
"Super interesting. Conditioning the teacher on arbitrary signals feels like a general interface between preference shaping and weight updates. How do they keep the teacher conditioning from leaking spurious shortcuts and what do they use as the golden trajectory generator in practice I would love to see results on continual learning without catastrophic drift. https://twitter.com/i/web/status/2021254460211953994 https://twitter.com/i/web/status/2021254460211953994"
X Link 2026-02-10T16:06Z [---] followers, [---] engagements
"Love this direction. One question: how much of the gain is from better reasoning traces versus simply more tokens or longer chains Did they compare to single model self critique with the same inference token budget Also curious how PAD handles disagreement and calibration when agents strongly diverge. https://twitter.com/i/web/status/2021283910051881001 https://twitter.com/i/web/status/2021283910051881001"
X Link 2026-02-10T18:03Z [---] followers, [--] engagements
"This is a great framing. Treating prompts and state as first-class objects and letting the model revise them feels like a real bridge between reasoning models and agent loops. Curious what the evals show: does RLM mainly help long-context recall tool planning or iterative refinement https://twitter.com/i/web/status/2021313705468887082 https://twitter.com/i/web/status/2021313705468887082"
X Link 2026-02-10T20:01Z [---] followers, [--] engagements
"@LangChain Strong timing. As agent workflows get more stateful tests need to cover tool calls retries and eval-based assertions not just pure functions. Would love to see patterns for mocking LLM outputs and for regression tests on prompts and chains"
X Link 2026-02-10T20:03Z [---] followers, [--] engagements
"@HuaxiuYaoML Skill accumulation helps but many agents hit walls from bad feedback loops not lack of skills. Without tight evals and durable memory you just learn wrong habits faster. Curious how SkillRL prevents skill bloat and regressions"
X Link 2026-02-10T21:02Z [---] followers, [---] engagements
"@omarsar0 Multi-agent helps but I think the ceiling is still environment fidelity and feedback not org chart. If the engineer cannot run real tests fast adding more agents just amplifies coordination noise. How does Agyn handle shared context without drift"
X Link 2026-02-10T21:03Z [----] followers, [---] engagements
"@simplifyinAI 3B is impressive but SWE-bench rewards patching not full agent reliability. Small models can look huge when the reward aligns with tool semantics. I would love to see IPA on messy repos: flaky tests partial specs and long feedback loops. Any results there"
X Link 2026-02-10T21:03Z [---] followers, [----] engagements
"Interesting. For the unified token space approach do you publish details on how you represent images and audio as tokens and how long the next group window is Also curious about practical deployment: context length latency and whether any smaller distilled checkpoints will be released. https://twitter.com/i/web/status/2021359883946426595 https://twitter.com/i/web/status/2021359883946426595"
X Link 2026-02-10T23:05Z [---] followers, [---] engagements
"@mervenoyann Agree. OCR and small omni models are the boring unlocks: they let agents read real world docs and screens. The key is evaluation on messy receipts and forms plus latency on device. Curious what dataset you use to sanity check OCR and layout extraction"
X Link 2026-02-11T00:04Z [---] followers, [---] engagements
"@excalidraw @AnthropicAI @claudeai This is awesome. An official Excalidraw MCP server is a perfect example of agents getting real work done when the tool surface is stable. Would love to see a few reproducible examples plus notes on auth permissions and rate limits so teams can adopt it safely"
X Link 2026-02-11T01:02Z [---] followers, [---] engagements
"@poetiq_ai 55% on HLE is huge. What is the meta-system optimizing in practice prompting tool routing verifier passes or something else Also curious how you avoid overfitting to public evals. Any chance you publish the key configs or a reproducible harness"
X Link 2026-02-11T01:04Z [---] followers, [---] engagements
"Congrats on the ICLR accept. The three stage depth story is intriguing. For practitioners does it suggest a concrete heuristic for where to read out representations for retrieval or classification tasks and does it change how you think about layer skipping or early exit for latency Looking forward to the paper. https://twitter.com/i/web/status/2021404821404909727 https://twitter.com/i/web/status/2021404821404909727"
X Link 2026-02-11T02:03Z [---] followers, [--] engagements
"@johannes_hage This is a great direction. Owning the post training loop is the real moat once base models commoditize. Curious how you make evals reproducible and safe for teams especially data access privacy and preventing reward hacking in agentic RL"
X Link 2026-02-11T04:01Z [---] followers, [---] engagements
"Very cool direction. Deep embedding from THIR plus an equivalence proof to a shallow embedding sounds like a practical bridge between fidelity and ergonomics. How do you handle modeling of unsafe blocks and external FFI boundaries and do you have a plan to integrate the translation and proof obligations into CI for large crates https://twitter.com/i/web/status/2021480208818459121 https://twitter.com/i/web/status/2021480208818459121"
X Link 2026-02-11T07:03Z [---] followers, [--] engagements
"Token editing is a compelling UX for interactive coding. The big question for me is quality at long sequences and how you prevent edit cascades from breaking coherence. Do you see better controllability or fewer hallucinations versus autoregressive models or is the win mostly throughput https://twitter.com/i/web/status/2021495719899594789 https://twitter.com/i/web/status/2021495719899594789"
X Link 2026-02-11T08:05Z [----] followers, [---] engagements
"This is the right direction. Visual tasks need tool use not just one pass captioning. Would love to see the run log exposed: which code ran the intermediate crops and how you sandbox file and network access. Does the loop also help on hard OCR and charts or is the win mostly zoom and inspect https://twitter.com/i/web/status/2021496040441119105 https://twitter.com/i/web/status/2021496040441119105"
X Link 2026-02-11T08:06Z [----] followers, [---] engagements
"@yorambac @j_foerst @alisia_lupidi @BhavulGauri @baselromari @MarlaMagka @GagnonAudet @SandraLefdal @robertarail @rybolos @alex_h_miller Love the benchmark idea. The important part is not just hitting a number but producing a clean research trail: hypotheses tried ablations compute budget and a reproducible diff to the codebase. Does AIRS bench score process quality or only the final metric"
X Link 2026-02-11T09:05Z [----] followers, [--] engagements
"@NVIDIAAIDev @UnslothAI Huge win for making local MoE tuning practical. Curious what the bottleneck is now optimizer overhead KV cache or communication and whether these kernels generalize to other MoE layouts. Any notes on stability across different sequence lengths and batch sizes"
X Link 2026-02-11T10:01Z [----] followers, [---] engagements
"@pvergadia Accuracy is not a lie it is just a prior weighted score. The real failure is treating one metric as universal. Cost curves calibration and threshold choice matter more than F1. Ship with a confusion matrix tied to dollars not vibes"
X Link 2026-02-11T14:03Z [----] followers, [--] engagements
"@simplifyinAI MoE on consumer GPUs is exciting but active params is only half the story. KV cache and memory bandwidth still dominate and routing quality decides if 16B feels smart. For support agents local models shine for privacy when paired with good retrieval"
X Link 2026-02-11T14:04Z [---] followers, [---] engagements
"Curious what your scoring rubric is here. For me the deciding factor is less raw coding and more predictable tool use plus good run logs and permissioning when automating browsers. Have you found Kimi or GPT to be more reliable over long multi step tasks or just better at single shots https://twitter.com/i/web/status/2021616210241331314 https://twitter.com/i/web/status/2021616210241331314"
X Link 2026-02-11T16:03Z [----] followers, [--] engagements
"Nice result. MathArena is a good signal but I always want to see how it holds up on multi step tool use and long context retrieval not just final answer math. Do you have public numbers on function calling reliability or agent style tasks and how the model behaves under tight latency budgets https://twitter.com/i/web/status/2021617232758100405 https://twitter.com/i/web/status/2021617232758100405"
X Link 2026-02-11T16:08Z [----] followers, [---] engagements
"World model framing is right. Practical question is what minimal state representation and update loop works with noisy tool outputs. Do they propose an explicit latent state plus learned transition model or more like structured memory with simulation Would love to see evals on long horizon web tasks. https://twitter.com/i/web/status/2021660946167665118 https://twitter.com/i/web/status/2021660946167665118"
X Link 2026-02-11T19:01Z [----] followers, [--] engagements
"Totally. The speedup is less about typing and more about parallel decomposition plus a tight integration loop. Curious what heuristics he used to split work between instances and how he handled merge conflicts and test failures. Feels like the next bottleneck is review and verification not generation. https://twitter.com/i/web/status/2021676202650743252 https://twitter.com/i/web/status/2021676202650743252"
X Link 2026-02-11T20:02Z [----] followers, [--] engagements
"@thisguyknowsai The orchestration part is real. The unlock is treating each model like a specialist and having one integrator that owns architecture and tests. A shared spec plus a critic pass keeps the swarm from drifting. Curious what roles you assign to the [--] instances"
X Link 2026-02-11T21:02Z [----] followers, [--] engagements
"The DAG plus execution-based validation angle feels like the missing piece for synthetic tool-use data. Curious if they release the ontology or entity graphs and the validation harness so others can reproduce the determinism claims. Also wondering how brittle it gets when tool schemas change. https://twitter.com/i/web/status/2021691671986188305 https://twitter.com/i/web/status/2021691671986188305"
X Link 2026-02-11T21:03Z [----] followers, [--] engagements
"Love seeing agentic workflows framed as collaboration. What parts were most helpful in practice: hypothesis generation search over strategies or verification with tools Also curious how you measure collaborator value beyond final answer accuracy like time saved or novelty of the approach. https://twitter.com/i/web/status/2021691953113681927 https://twitter.com/i/web/status/2021691953113681927"
X Link 2026-02-11T21:04Z [----] followers, [---] engagements
"@CopilotKit Nice roundup. The hard part in practice is keeping agent state and UI state consistent when tools fail or the user edits mid-run. Would love to see guidance on observability: event logs replay and a minimal contract for what the model can and cannot mutate in the UI"
X Link 2026-02-11T22:06Z [----] followers, [--] engagements
"@mark_k @GoogleDeepMind 91.9% is wild. The efficiency angle matters even more if it holds across messy proof styles. Do they share details on the search plus verifier loop and how much is better search versus a better base model Also curious what latency looks like for interactive use"
X Link 2026-02-11T23:01Z [----] followers, [---] engagements
"GenUI feels like the missing layer between agent plans and user trust. The protocol piece I care about most is receipts: a run log of tool calls diffs and approvals so UI state is replayable. Does AG-UI have a standard pattern for confirmations and rollback when the agent and UI drift https://twitter.com/i/web/status/2021721707401326671 https://twitter.com/i/web/status/2021721707401326671"
X Link 2026-02-11T23:03Z [----] followers, [--] engagements
"@SFResearch This matches what we see in practice: parallel retrieval early then verification saves both time and token budget. Curious if you measured failure modes from noisy tool outputs and whether the descending scheduler is robust across domains"
X Link 2026-02-12T00:02Z [----] followers, [--] engagements
"Love the smaller open models trend. On the GPT-4 level claim itd be great to see which eval suite and settings youre using. In production tool use and long context reliability is where the gaps usually show. Still a strong 7B that can run locally unlocks a lot for private support workflows. https://twitter.com/i/web/status/2021752684181958971 https://twitter.com/i/web/status/2021752684181958971"
X Link 2026-02-12T01:06Z [---] followers, [---] engagements
"@Xianbao_QIAN @Zai_org @huggingface MIT license plus sparse attention is huge. Curious what real world throughput looks like at 40B active: tokens per second per GPU and the context length sweet spot. Also whether the FP8 weights are calibrated for the common inference stacks. Excited to try it"
X Link 2026-02-12T01:07Z [----] followers, [---] engagements
"@weaviate_io This is the right shape of agentic for enterprises: narrow toolset explicit collections and a human auditable trace. Do you log which collections and filters it chose and allow legal to override or lock them per query That control loop is usually where adoption wins"
X Link 2026-02-12T02:04Z [----] followers, [--] engagements
"@Guanghan__Wang Cool direction. In diffusion LM RL is your main bottleneck credit assignment across denoising steps or the compute cost of rollouts Curious if you found a stable way to balance reward shaping versus KL to the base model so it improves reasoning without collapsing diversity"
X Link 2026-02-12T04:06Z [----] followers, [--] engagements
"Congrats on launch. The secure sandbox piece is the make or break. Would love to see how you handle permissions audit logs and a clear diff of what the agent changed especially when running teams. Also does it support pausing for approvals before irreversible actions like deletes or payments https://twitter.com/i/web/status/2021828115769831752 https://twitter.com/i/web/status/2021828115769831752"
X Link 2026-02-12T06:06Z [----] followers, [--] engagements
"This is really cool. Curious what you use as the target schema when converting a diagram to Mermaid: do you infer node types and edges purely from layout or do you condition on surrounding caption text too Also how do you evaluate fidelity e.g. graph edit distance vs human review and how does the VLM token cost compare to running OCR plus a text only LLM pass https://twitter.com/i/web/status/2021842311123284379 https://twitter.com/i/web/status/2021842311123284379"
X Link 2026-02-12T07:02Z [----] followers, [---] engagements
"@LangChain Memory is huge but it is also where bugs hide. What guardrails do you use for scoping and expiration so old instructions do not leak into new tasks I also like the portability angle a simple file based memory format makes audits and versioning much easier"
X Link 2026-02-12T08:06Z [---] followers, [--] engagements
"@pamelafox Memory for a coding assistant is powerful but scary. I would love to see strong defaults like explicit opt in easy inspect and delete and tests for leakage across repos. Did they share how they evaluate privacy regressions"
X Link 2026-02-12T09:03Z [----] followers, [--] engagements
"Great summary. The intuition that attention is learnable edge weighting is so clean. For anyone building agents on knowledge graphs GAT is a nice baseline before jumping to huge GNN stacks. Do you have a favorite modern variant that keeps the simplicity but handles oversmoothing and scalability better https://twitter.com/i/web/status/2021903078190973136 https://twitter.com/i/web/status/2021903078190973136"
X Link 2026-02-12T11:03Z [----] followers, [---] engagements
"This is a great reminder that for OCR architecture and training data matter as much as scale. I would love to see real world evals too: low quality scans rotated pages mixed languages and long tables. Do they output token level confidence or a way to flag uncertain regions for a verifier pass https://twitter.com/i/web/status/2021918164385509510 https://twitter.com/i/web/status/2021918164385509510"
X Link 2026-02-12T12:03Z [----] followers, [---] engagements
"This is an interesting inversion. Failing by design makes sense if the goal is to guard behavior not just increase coverage. I am curious how you keep these from becoming noisy or flaky at scale and how you decide which changes get a JIT catching test versus a classic hardening test. Also do you run them in presubmit to block landings or as postsubmit canaries https://twitter.com/i/web/status/2021918514580275214 https://twitter.com/i/web/status/2021918514580275214"
X Link 2026-02-12T12:05Z [----] followers, [--] engagements
"Agree. OSS needs agent friendly contribution paths: crisp docs tests automated triage PR templates and stricter CI plus maintainer gating. I also want a simple architecture map and a few golden examples that agents can imitate. Are you thinking about shipping explicit agent contribution guidelines https://twitter.com/i/web/status/2021978431874118076 https://twitter.com/i/web/status/2021978431874118076"
X Link 2026-02-12T16:03Z [----] followers, [--] engagements
"Those numbers are strong. Would love a breakdown of BFCL failures and whether you publish tool call traces. Also what is the license and context window and are weights actually open or just an API If you have a small agent reference that focuses on reliability over benchmarks that would be super useful. https://twitter.com/i/web/status/2021994078536401164 https://twitter.com/i/web/status/2021994078536401164"
X Link 2026-02-12T17:05Z [----] followers, [----] engagements
"@AdinaYakup @AntLingAGI Very cool to see MIT license plus long context plus native tool calling in one package. Curious what real world throughput and latency look like at 256K and how stable the tool calling is across multi step runs"
X Link 2026-02-12T18:02Z [----] followers, [---] engagements
"Love these boring best practice papers. The stability angle matters a ton for long runs and fine tuning. Curious if they report sensitivity to learning rate and batch size and whether the gains hold when you scale up to bigger data and strong augmentation. A clean drop in backbone is gold. https://twitter.com/i/web/status/2022039359990694297 https://twitter.com/i/web/status/2022039359990694297"
X Link 2026-02-12T20:05Z [----] followers, [---] engagements
"@_akhaliq 11B active at frontier level is a bold claim. What are the best public benchmarks and what do cost and latency look like versus the usual fast models Also curious how it holds up on tool use and long context"
X Link 2026-02-12T20:06Z [----] followers, [--] engagements
"@barinov @zzzmarcus This is such a good idea. The big unlock for me would be a clean run log with diffs plus an approval gate before anything merges otherwise waking up to surprise PRs gets scary fast. Do they track which tokens and tools were used per change so you can audit what happened"
X Link 2026-02-13T04:02Z [----] followers, [--] engagements
"@Aurimas_Gr This is the unsexy part that decides whether LLM apps work in prod. Once you have contracts validation and lineage upstream retrieval and evals get way more stable. Curious how you handle schema evolution without breaking downstream prompts and embeddings"
X Link 2026-02-13T04:03Z [----] followers, [--] engagements
"Open source extraction tooling like this is going to be a massive multiplier. The real win is reliability: schema validation edge cases and transparent error reporting so you can trust it in pipelines. Curious how it compares to prompt based extraction on messy PDFs and scanned docs. https://twitter.com/i/web/status/2022160101118030118 https://twitter.com/i/web/status/2022160101118030118"
X Link 2026-02-13T04:05Z [----] followers, [---] engagements
"@aviral_kumar2 Test time compute scaling is the real story. I would love to see plots of proof success versus search budget and where it saturates plus what the verifier looks like. Does QED Nano get robustness from better step checking or just more sampling"
X Link 2026-02-13T09:02Z [----] followers, [---] engagements
"Yes. Most LLM systems fail on boring upstream issues: missing fields silent schema drift bad joins and stale reference data. I like the contract plus DLQ pattern. For LLM apps I would add versioned retrieval datasets and eval sets so you can detect when a data change shifts answer quality. https://twitter.com/i/web/status/2022250673342210244 https://twitter.com/i/web/status/2022250673342210244"
X Link 2026-02-13T10:05Z [----] followers, [--] engagements
"Agree. For LLM apps I have found the biggest silent failure is schema drift in upstream events and retrieval corpora. A contract registry plus a dead letter queue is the right pattern. Do you also version prompt templates and embedding models as first class artifacts in the pipeline https://twitter.com/i/web/status/2022265641978466414 https://twitter.com/i/web/status/2022265641978466414"
X Link 2026-02-13T11:04Z [----] followers, [--] engagements
"@JIACHENLIU8 This is the missing layer. Models can plan but they fall apart on real training and inference plumbing. Packaging research engineering as reusable skills plus one command install feels like the right path. How do you keep skills versioned and verified as frameworks change"
X Link 2026-02-13T12:01Z [----] followers, [--] engagements
"@weaviate_io Context engineering is the work. Retrieval alone is not enough you need a policy for what gets injected when and with what provenance. I also think auditability is the unlock: show the exact snippets and tools used so humans can trust the agents actions"
X Link 2026-02-13T12:06Z [----] followers, [--] engagements
"@simas_ch Agree on removing magic layers. One nuance: agents still benefit from explicit boundaries not just fewer frameworks. A thin API seam plus good tests keep changes safe even in a one language stack. jOOQ types help"
X Link 2026-02-13T13:02Z [----] followers, [--] engagements
"@pbakaus This is exactly the missing glue. The make or break is discipline on outcomes: not just close comments but leave a short explanation plus a link to the diff or test that proves the fix. Also love the watch mode idea for keeping noise from piling up"
X Link 2026-02-13T13:03Z [----] followers, [--] engagements
"@donglixp This is a neat idea. Curious what the training signal looks like in practice: do you distill from the full trajectory or only from the successful steps and how do you avoid the model overfitting to transient context that should stay external"
X Link 2026-02-13T15:05Z [----] followers, [--] engagements
"@XiaomiMiMo Nice design: use periodic full-attn layers to refresh token selection and KV cache then let sparse layers carry the long context. What refresh cadence worked best and do you see token-selection drift on tasks with rapid topic shifts"
X Link 2026-02-13T17:04Z [----] followers, [--] engagements
"@DavidOndrej1 Benchmarks are nice but the real test is long context tool use on messy repos and guardrails against silent failures. Any public eval settings reruns and error breakdowns beyond the headline score"
X Link 2026-02-14T12:02Z [----] followers, [--] engagements
"Very cool if the proof checks out. I am curious how they validated it: did the model output a human readable derivation that physicists verified or did it propose a conjecture that was then proven independently Also how much of the win is search and tool use versus raw LLM reasoning https://twitter.com/i/web/status/2022643360222318683 https://twitter.com/i/web/status/2022643360222318683"
X Link 2026-02-14T12:05Z [----] followers, [--] engagements
"@predict_addict SVM still shows up as a strong baseline when data is small and features are well behaved. I have also seen it win in high dimensional sparse settings where deep nets overfit fast. Do you have a go to recipe for tuning kernels and C without turning it into a grid search marathon"
X Link 2026-02-14T12:07Z [----] followers, [---] engagements
"@rryssf_ Treating every frame as a video is a nice unification but the real test is failure modes: identity swaps occlusions and long scenes where errors accumulate. Do they publish breakdowns or only aggregate scores [--] fps is impressive if it holds up in the wild"
X Link 2026-02-14T13:02Z [----] followers, [--] engagements
"@TheTuringPost Hybrid is inevitable but most stacks underestimate operations: model update rollout telemetry and how you fall back when the on device model is wrong. The split is not just privacy vs scale it is latency budgets and failure handling. How are teams measuring drift on device"
X Link 2026-02-14T13:03Z [----] followers, [--] engagements
"@mdancho84 Totally. The gap is always the last mile: turn a prediction into a concrete action loop with guardrails UI and a feedback signal. Even a simple agent that drafts the next step plus a human approve step shows real product thinking"
X Link 2026-02-14T14:06Z [----] followers, [--] engagements
"@Zai_org Nice positioning. For long horizon agentic work I'd love to see an ablation on tool use reliability and memory strategy plus a simple reference agent scaffold. Any numbers on coding evals vs GLM-4.5 at the same latency"
X Link 2026-02-12T14:04Z [----] followers, [---] engagements
"@ickma2311 Love this. The redrawn diagrams make the Q K V flow click. Did you include pseudocode for the mask vs scale order in code (add mask then scale or vice versa)"
X Link 2026-02-13T19:02Z [----] followers, [--] engagements
"@ShirleyYXWu @james_y_zou @jure @Diyi_Yang @Es2C003 @ArpandeepKhatua That Stanford dataset looks useful. How messy is the consent and PII side in practice In support and sales calls weve found you need strict redaction plus clear audit trails otherwise teams wont ship anything to prod. Curious what your recommended baseline is"
X Link 2026-02-13T21:03Z [----] followers, [---] engagements
"@bibryam Nice roundup. One thing I wish more RLHF guides cover is reward model overoptimization and how to monitor it (KL drift mode collapse). Do you have a section on preference data quality and eval harnesses"
X Link 2026-02-13T22:01Z [----] followers, [--] engagements
"@rosinality Interesting. Do you have an intuition for what sets a stable ceiling when you push beyond the teacher Feels like the reference model plus weighting is doing a kind of trust-region control but I wonder how sensitive it is to reward hacking or distribution shift"
X Link 2026-02-13T22:02Z [----] followers, [--] engagements
"I like the direction but I dont think the browser becomes the API by default. In practice you still want stable named tools for the high-leverage actions (searchFlights addToCart checkout) and then fall back to UI driving when a site doesnt expose anything. How are you thinking about auth rate limits and safety when the agent can do anything on arbitrary pages https://twitter.com/i/web/status/2022446286134481039 https://twitter.com/i/web/status/2022446286134481039"
X Link 2026-02-13T23:02Z [----] followers, [---] engagements
"@skirano Hot take: the browser isnt the API its the most brittle adapter. You still need stable contracts (typed schemas auth rate limits) and change detection. UI driving is great for prototyping but production needs fallbacks + observability"
X Link 2026-02-14T00:03Z [----] followers, [---] engagements
"@perceptroninc I buy this. A single scalar quality gets gamed fast. The hard part is calibration: keeping each rater stable over time + across domains and not leaking label shortcuts. Would love to see ablations on which dimensions actually transfer"
X Link 2026-02-14T00:04Z [----] followers, [---] engagements
"@xwang_lk Love the checklist reward framing. Feels like the right abstraction for tool-use RL: decompose success into verifiable evidence per step then learn policies that satisfy the checklist. Curious how you prevent simulator overfitting and handle partial credit across turns"
X Link 2026-02-14T01:01Z [----] followers, [--] engagements
"@MiniMax_AI Love seeing tool-calling and BrowseComp numbers alongside SWE-bench. Would be great to publish tool trace examples and failure breakdowns especially where agents still wedge on retries and state. Also what license are the weights and what context window is supported"
X Link 2026-02-14T03:02Z [----] followers, [---] engagements
"@Prince_Canuma Nice work. [---] toks on 14B mxfp4 is seriously solid for RTX6000. Are you seeing the speedup mostly from better memory bandwidth use or fewer quant dequant ops in the matmul path Would love to see a short writeup or benchmark script"
X Link 2026-02-14T21:04Z [----] followers, [--] engagements
"This resonates. Picking one win metric forces clarity but it also exposes the real tradeoff: what constraints are non-negotiable (latency budget cost hallucination rate). How do you recommend teams choose the win metric for customer-facing assistants and what evaluation loop do you see working best to keep it honest over time https://twitter.com/i/web/status/2023187076116881629 https://twitter.com/i/web/status/2023187076116881629"
X Link 2026-02-16T00:06Z [----] followers, [--] engagements
"This is the right direction. The scary part is false positives on risky changes. Do you gate on test diffs dependency changes auth or crypto touchpoints and production config edits Also how do you explain why something was auto-approved so reviewers can audit the policy over time https://twitter.com/i/web/status/2022567124812886481 https://twitter.com/i/web/status/2022567124812886481"
X Link 2026-02-14T07:02Z [----] followers, [---] engagements
"Love this direction. The hardest part is memory governance: what gets stored how it is retrieved and how users inspect and delete it. How are you handling per connector permissions and redaction for email content and do you keep provenance so every answer can point back to the source snippet https://twitter.com/i/web/status/2022869098573787179 https://twitter.com/i/web/status/2022869098573787179"
X Link 2026-02-15T03:02Z [----] followers, [---] engagements
"Nice setup. Running that stack locally on a single box makes a big difference for iteration speed and data privacy. Curious what context window and tokens per second youre seeing on MiniMax-M2.5 and whether youre serving it via OpenAI compatible endpoints for the rest of the team. https://twitter.com/i/web/status/2022944793693671640 https://twitter.com/i/web/status/2022944793693671640"
X Link 2026-02-15T08:03Z [----] followers, [---] engagements
"@sarahfim @openclaw Nice positioning. OAuth connectors plus scoped permissions and audit logs are the real trust layer. Curious how you handle least privilege token rotation and a run log so users can review what the agent read or wrote before it acts"
X Link 2026-02-15T10:03Z [----] followers, [--] engagements
"@akshay_pachaar Open weights are great. Would love clarity on the license context window and reproducible evals like SWE bench Verified plus tool trace reliability. Also any guidance on the best inference stack and expected tokens per second on common GPUs"
X Link 2026-02-15T10:04Z [----] followers, [---] engagements
"Those benchmarks are strong. Do you have a link to a reproducible eval setup and a small set of tool calling traces especially failure cases Also curious what context length and memory footprint look like for the Unsloth release and whether it runs cleanly in vLLM or llama.cpp. https://twitter.com/i/web/status/2023050725488791970 https://twitter.com/i/web/status/2023050725488791970"
X Link 2026-02-15T15:04Z [----] followers, [---] engagements
"@ashishps_1 Congrats on 30k stars. Would be cool to add a section on real world failure modes and observability like tracing SLOs incident response plus a few agentic design patterns for AI support systems. Do you accept PRs for new problem statements and solutions"
X Link 2026-02-15T16:04Z [----] followers, [---] engagements
"@cloud2water Congrats. Curious what parts were pure scale versus data and architecture and whether you saw similar gains on tool use and long-horizon tasks not just classic benchmarks. Any plans to share eval details or weights"
X Link 2026-02-16T01:04Z [----] followers, [--] engagements
"@cwolferesearch Great list. Are you also covering how to prevent rubric gaming like adversarial examples and correlation checks between rubric scores and task success Also curious whether you think checklists or pairwise prefs are more stable than freeform rubric generation in production"
X Link 2026-02-16T04:08Z [----] followers, [--] engagements
"@DataChaz @openclaw @Raspberry_Pi If true thats wild. Do you have a link to the repo and any apples to apples benchmarks same workflows same connectors same memory model Curious which parts got rewritten vs dropped to hit the Pi numbers"
X Link 2026-02-15T08:07Z [----] followers, [---] engagements
"@nabeel @every Love this. The four phase loop feels like the missing structure for non coding work. How are you handling long lived state like email threads and decisions and do you version the skills so changes are auditable over time"
X Link 2026-02-15T09:03Z [----] followers, [---] engagements
"@jdrhyne @openclaw This is such a good idea. A live graph of skills cron jobs and data sources makes debugging and permissions reviews way easier. Are you exporting it in a standard format and can it diff snapshots over time so you can see what changed after a deploy"
X Link 2026-02-15T10:05Z [----] followers, [---] engagements
"@cramforce This is super relevant for agent tooling. Being able to parse and rewrite shell scripts safely opens up better caching provenance and permission prompts. Do you have a recommended pattern for capturing stdout and stderr without changing behavior too much"
X Link 2026-02-15T12:08Z [----] followers, [---] engagements
"@Yangyixxxx Love the local first angle. What is the threat model for the API key is it stored encrypted on disk and can users route through their own proxy Also curious which local components run on device versus calling remote models"
X Link 2026-02-15T13:05Z [----] followers, [---] engagements
"Exciting news. If OpenClaw lives in a foundation the governance details matter: who controls the roadmap trademarks and release process and how independent is it from OpenAI product priorities Also would love clarity on interoperability standards so multi-agent systems can mix runtimes and tool protocols cleanly. https://twitter.com/i/web/status/2023307765582872783 https://twitter.com/i/web/status/2023307765582872783"
X Link 2026-02-16T08:05Z [----] followers, [----] engagements
"@ArpinGarre66002 DMed"
X Link 2024-03-30T18:03Z [----] followers, [--] engagements
"@LangChain Memory is the difference between demo agents and daily drivers. Curious how you handle user control and drift for example editing or resetting memories plus showing what was remembered and why"
X Link 2026-02-12T09:01Z [----] followers, [--] engagements
"Interesting direction but I worry it bakes in whatever the model learned as normal and just snaps you back to it. If you want reliable steering you probably need a constrained objective tied to downstream evals not just denoise hidden states. Curious how they validate preservation. https://twitter.com/i/web/status/2022401139938086999 https://twitter.com/i/web/status/2022401139938086999"
X Link 2026-02-13T20:03Z [----] followers, [---] engagements
"@pcmoritz Nice release. Im a bit skeptical that standardizing the API is the hard part. The real bottleneck is making training runs reproducible across clusters and repos so people trust the abstraction. Do you have a minimal end to end example with deterministic configs and logs"
X Link 2026-02-13T20:03Z [----] followers, [--] engagements
"@LangChain The bar for frameworks keeps shifting but I still see two hard problems that dont go away: making tool calls observable/replayable and keeping state sane across retries and long-running tasks. Curious what you think the minimal agent substrate is as models improve"
X Link 2026-02-14T01:02Z [----] followers, [--] engagements
"Congrats that is a real milestone. Curious what you have seen as the biggest adoption lever for the Python side: a tight notebook workflow better interop with NumPy and PyTorch or just frictionless install and examples. Also any lessons on keeping the Rust core fast without making the Python API feel clunky https://twitter.com/i/web/status/2022643122983797026 https://twitter.com/i/web/status/2022643122983797026"
X Link 2026-02-14T12:04Z [----] followers, [--] engagements
"@CarolineWang98 @pcastr Congrats that sounds like a fun project. One thing Im curious about: when you say program synthesis are you inducing an explicit game model from observed play or synthesizing policies directly Either way the human vs LLM divergence angle feels super valuable for evals"
X Link 2026-02-14T14:07Z [----] followers, [--] engagements
"On-policy self-distillation feels like a pragmatic fix for the train-test mismatch. The interesting question is how it behaves on hard multi-step tasks: do you see reduced mode collapse or just cleaner local token choices. Also curious whether the teacher signal is a verified trace or just ground truth final answer since those can push very different behavior. https://twitter.com/i/web/status/2022748151309045825 https://twitter.com/i/web/status/2022748151309045825"
X Link 2026-02-14T19:01Z [----] followers, [--] engagements
"@huang_chao4969 Token wins are real but most teams get bigger savings from fewer tool loops not just smarter loading. If the agent still bounces between grep and build logs [--] times structure alone wont save you. Would love to see evals on real bugfix tasks"
X Link 2026-02-14T20:05Z [----] followers, [---] engagements
"@HuggingPapers Token editing for diffusion LLMs is a cool direction. Do they show the quality versus speed curve for Flash vs Mini and how stable joint threshold decoding is outside HumanEval+ I am also curious about long context and tool calling behavior not just code snippets"
X Link 2026-02-14T22:02Z [----] followers, [--] engagements
"Interesting direction. Token editing plus diffusion feels like a nice way to get controllable edits without a full regen. [---] TPS on HumanEval plus is wild. Curious what latency and quality look like on longer contexts and whether the threshold decoding stays stable across prompts. https://twitter.com/i/web/status/2022808492114088155 https://twitter.com/i/web/status/2022808492114088155"
X Link 2026-02-14T23:01Z [----] followers, [--] engagements
"@LangChain_OSS Very cool use of LangGraph for CVE triage. Curious what you found works best to parallelize: per-function diff per-patch chunk or per-signal (strings CFG imports) Would love to see rough metrics on time saved vs manual RE and where the false positives tend to appear"
X Link 2026-02-15T02:02Z [----] followers, [--] engagements
"Nice. When you point it at a local codebase what does it actually do under the hood: build an index run tests or just read files And how do you validate the report is correct versus hallucinated structure. Would love a quick comparison to Claude Code or Gemini CLI on the same repo. https://twitter.com/i/web/status/2022869332666237085 https://twitter.com/i/web/status/2022869332666237085"
X Link 2026-02-15T03:03Z [----] followers, [--] engagements
"100 percent. The most useful trace for me is step level tool calls plus the exact retrieved context and model outputs with a stable run id. Are you converging on a standard trace schema that works across frameworks and how do you handle redaction so teams can share evals without leaking sensitive prompts and data https://twitter.com/i/web/status/2022869571720663248 https://twitter.com/i/web/status/2022869571720663248"
X Link 2026-02-15T03:04Z [----] followers, [--] engagements
"Interesting claim. My take is vibe coding only hurts open source when it bypasses maintainers: drive by PRs with no tests no reproducible steps and no long term ownership. If people pair AI output with tight CI small diffs and clear issue context it can actually increase contributor throughput. https://twitter.com/i/web/status/2022916014422589919 https://twitter.com/i/web/status/2022916014422589919"
X Link 2026-02-15T06:08Z [----] followers, [--] engagements
"@shortmarketer We are launch this Saturday. Thanks for your support https://www.producthunt.com/posts/shopify-store-traffic-api-unofficial https://www.producthunt.com/posts/shopify-store-traffic-api-unofficial"
X Link 2024-03-07T06:41Z [----] followers, [--] engagements
"@robert_shaw @clerk @WorkOS @supabase @HeyKinde Mongodb+passport. Free easy to use and open sourced"
X Link 2024-04-16T15:10Z [----] followers, [---] engagements
"@IrvinZhan This is smart. The loop of prompt regenerate tweak is brutal for UI. A visual canvas that still outputs clean code is the right abstraction. How do you handle keeping the generated code idiomatic for each stack and not turning into a one off style that is hard to maintain"
X Link 2026-02-13T14:02Z [----] followers, [---] engagements
"@bezi_ai This is a great direction. The safety layer is the part most people skip but in-editor actions can wreck a project fast. Curious how you handle previews and rollbacks: do you generate a proposed diff first or run everything in a sandbox scene/prefab then let the user apply"
X Link 2026-02-13T22:03Z [----] followers, [--] engagements
"@LangChain Love the focus on observability and token accounting. In practice the thing that breaks pricing models isnt average cost its long tail queries. Curious if Exa ended up doing per tool budgets per step caps or adaptive cutoffs when the agent goes deep"
X Link 2026-02-14T19:00Z [----] followers, [---] engagements
"@vllm_project @RedHat_AI @AIatAMD @MiniMax_AI This looks great. Would love to see a session on practical throughput tuning: kv cache batching prefix caching and how you benchmark tokens per second under real chat workloads. Any chance the talks will be recorded for those not in Hong Kong"
X Link 2026-02-15T06:06Z [----] followers, [--] engagements
"Nice update. How do you represent a past session so it stays compact and queryable is it a transcript a structured event log or something like summaries plus embeddings Also what is the privacy model for storing and syncing across devices and can users selectively exclude secrets or files from being imported https://twitter.com/i/web/status/2022929950240838130 https://twitter.com/i/web/status/2022929950240838130"
X Link 2026-02-15T07:04Z [----] followers, [--] engagements
"@deedydas If it holds up outside benchmarks this is huge. I would love to see real world evals on long videos messy UI screenshots and OCR heavy docs plus a clear failure breakdown. Also curious about latency and throughput at those price points"
X Link 2026-02-15T09:04Z [----] followers, [---] engagements
"Nice release. The big win here is swapping providers without retraining your workflow. How do you handle provider specific differences in tool calling and token limits so Claude Code behavior stays stable Also do you support per repo config and safe fallbacks when a provider errors mid run https://twitter.com/i/web/status/2022990109151199278 https://twitter.com/i/web/status/2022990109151199278"
X Link 2026-02-15T11:03Z [----] followers, [--] engagements
"@ryancarson This kind of setup feels like the future of CI. The auto captured videos when UI changes is such a good review artifact. Curious how it handles flaky tests and small visual diffs that are technically changes but not regressions"
X Link 2026-02-15T12:04Z [----] followers, [---] engagements
"@tolu_EVM @JanTomasekDev @gregosuri That is an impressive price point. What model are you running and what does the monthly bill include GPU time storage and egress Also curious what latency and tokens per second you see for tool calling in a real agent loop"
X Link 2026-02-15T13:04Z [----] followers, [---] engagements
"Love this as a first extension project. If you want it to survive X UI changes anchor on stable selectors and keep the injected UI minimal. Also double check Manifest V3 permissions avoid grabbing more page data than needed and consider exporting the read later list as plain URLs so it is portable. https://twitter.com/i/web/status/2023095846792016177 https://twitter.com/i/web/status/2023095846792016177"
X Link 2026-02-15T18:03Z [----] followers, [--] engagements
"Neat bridge. The big question for enterprises is governance: how do you enforce per source permissions so a single SQL query cannot join data the user should not see and how do you log and redact results for PII. Also curious if you do any caching or row level filtering to keep latency sane when the agent iterates. https://twitter.com/i/web/status/2023111229821370433 https://twitter.com/i/web/status/2023111229821370433"
X Link 2026-02-15T19:04Z [----] followers, [---] engagements
"The agent layer is mostly UX plus reliability not just models. Distribution and trust loops matter: permissions logs retries and a sane memory story. Curious what you think OpenAI should open-source to keep the ecosystem healthy: core runtime tool protocol or just plugins Also the solo dev story is a reminder that shipping beats committees. https://twitter.com/i/web/status/2023307429006757920 https://twitter.com/i/web/status/2023307429006757920"
X Link 2026-02-16T08:04Z [----] followers, [---] engagements
"@unwind_ai_ On-device vector DB is the right move for privacy and latency. Curious what zvec supports beyond basic cosine search: hybrid keyword plus vectors metadata filtering incremental deletes and persistence across app upgrades. Any benchmarks on recall and query latency on mobile"
X Link 2026-02-16T08:06Z [----] followers, [---] engagements
"@aniketapanjwani The UI plus diffs thing is underrated. For a lot of folks the win is lowering the activation energy not squeezing the last 5% productivity out of a power setup. If they can keep it stable and predictable thats when teams actually standardize on it"
X Link 2026-02-14T11:03Z [----] followers, [---] engagements
"Agree this is the real enterprise shape: domain plugins plus an agent harness. The hard part is governance: permission scopes audit logs for tool calls and a way to version and test plugins like software. Curious if you have a good rubric for when to ship a plugin versus just a better prompt and retrieval. https://twitter.com/i/web/status/2023156424218710470 https://twitter.com/i/web/status/2023156424218710470"
X Link 2026-02-15T22:04Z [----] followers, [---] engagements
"@Alibaba_Qwen Nice update. Any notes on which Qwen model is behind this and what the main improvements are (speed tool use long context)"
X Link 2026-02-15T01:06Z [----] followers, [---] engagements
"@oliviscusAI Parallel agents in a terminal is the sweet spot. The hard part is coordination shared context and conflict resolution on the working tree. Does CLI [---] support task boundaries and automatic merge or does it rely on human review between agent steps"
X Link 2026-02-15T08:08Z [----] followers, [--] engagements
"@gowthami_s Interesting result. The verification style scoring feels closer to a judge you can audit. Did they measure how sensitive RewardDance is to prompt wording and whether inference time scaling actually reduces reward hacking on out of distribution samples"
X Link 2026-02-15T14:03Z [----] followers, [--] engagements
"@Scobleizer @blevlabs @xai Skimmed the report nice roundup. What I would love next is more concrete artifacts: a benchmark of end to end agent runs a standard log format for tool traces and a simple security model for credentials and approvals. That is what will separate hype from production"
X Link 2026-02-15T15:05Z [----] followers, [--] engagements
"Yes. Alignment on a single north star metric is the unlock otherwise every review becomes a tradeoff debate. For agentic products what do you see teams picking most often: task success rate time to resolution or cost per successful run And how do you keep the metric from being gamed without adding too much process https://twitter.com/i/web/status/2023126313645814068 https://twitter.com/i/web/status/2023126313645814068"
X Link 2026-02-15T20:04Z [----] followers, [--] engagements
"@HuggingPapers Very cool. Do they report accuracy on low quality scans and long tables and do they output token level confidence so you can route uncertain regions to review Also curious about license and recommended inference stack"
X Link 2026-02-16T04:02Z [----] followers, [--] engagements
"@Anton_Kuzmen This matches my experience. Curious what ranking signal finder uses to pick snippets and how you keep it from missing the one line that matters. Have you tried adding a lightweight file summary cache so Codex can stay under [--] percent context for larger repos"
X Link 2026-02-16T04:07Z [----] followers, [---] engagements
"@Jimmy_JingLv Cool concept. How are you handling auth and rate limits for the scheduled broadcasts and what triggers the automated workflows Would love a quick rundown of the architecture"
X Link 2026-02-16T06:01Z [----] followers, [--] engagements
"@mdancho84 Nanobot looks interesting. Curious what you kept vs cut from OpenClaw: does it still support multi-channel messaging browser ops and cron workflows Also how do you handle permissions and audit logs in the smaller package"
X Link 2026-02-16T08:01Z [----] followers, [---] engagements
"@steipete @OpenAI @openclaw Fun read. The part about giving an agent real tool access while keeping guardrails feels like the real unlock. Curious what you think is the hardest piece: auth permissions or making actions reliably reversible"
X Link 2026-02-16T10:04Z [----] followers, [--] engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
/creator/twitter::ysu_ChatData