Dark | Light
# ![@zhuokaiz Avatar](https://lunarcrush.com/gi/w:26/cr:twitter::1777066011579002880.png) @zhuokaiz Zhuokai Zhao

Zhuokai Zhao posts on X about model, llm, claude code, code the most. They currently have [-----] followers and [--] posts still getting attention that total [-------] engagements in the last [--] hours.

### Engagements: [-------] [#](/creator/twitter::1777066011579002880/interactions)
![Engagements Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1777066011579002880/c:line/m:interactions.svg)

- [--] Week [---------] +13,298%
- [--] Month [---------] +5,087%
- [--] Months [---------] +1,726,167%
- [--] Year [---------] +387,680%

### Mentions: [--] [#](/creator/twitter::1777066011579002880/posts_active)
![Mentions Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1777066011579002880/c:line/m:posts_active.svg)

- [--] Week [--] +150%
- [--] Month [--] +413%
- [--] Months [--] +2,450%
- [--] Year [--] +5,200%

### Followers: [-----] [#](/creator/twitter::1777066011579002880/followers)
![Followers Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1777066011579002880/c:line/m:followers.svg)

- [--] Week [-----] +15%
- [--] Month [-----] +20%
- [--] Months [-----] +249%
- [--] Year [-----] +302%

### CreatorRank: [-------] [#](/creator/twitter::1777066011579002880/influencer_rank)
![CreatorRank Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1777066011579002880/c:line/m:influencer_rank.svg)

### Social Influence

**Social category influence**
[technology brands](/list/technology-brands)  [finance](/list/finance) 

**Social topic influence**
[model](/topic/model) #1533, [llm](/topic/llm), [claude code](/topic/claude-code), [code](/topic/code), [ai](/topic/ai), [agentic](/topic/agentic) #454, [systems](/topic/systems), [agents](/topic/agents), [attention](/topic/attention), [complex](/topic/complex)

**Top accounts mentioned or mentioned by**
[@teortaxestex](/creator/undefined) [@zrchenaisafety](/creator/undefined) [@clwdbot](/creator/undefined) [@zaiorg](/creator/undefined) [@zaiorgs](/creator/undefined) [@bolinfest](/creator/undefined) [@yanlaiyang](/creator/undefined) [@mengyer](/creator/undefined) [@luofuli](/creator/undefined) [@easyldur](/creator/undefined) [@sneycampos](/creator/undefined) [@bariskayhan1](/creator/undefined) [@ratebtm](/creator/undefined) [@liljamesjohn](/creator/undefined) [@albertobonsanto](/creator/undefined) [@askcodi](/creator/undefined) [@makerinparadise](/creator/undefined) [@zaicodinghelper](/creator/undefined) [@hmemcpy](/creator/undefined) [@nikosbaxevanis](/creator/undefined)
### Top Social Posts
Top posts by engagements in the last [--] hours

"We (@chaoqi_w @yibophd @ZRChen_AISafety) have been eager to share our latest work on battling reward hacking since last November but had to wait for the legal team's approval. Finally we're excited to release: Causal Reward Modeling (CRM) CRM tackles spurious correlations and mitigates reward hacking in RLHF by integrating causal inference and enforcing counterfactual invariance. It addresses biases like length concept sycophancy and discrimination enabling more trustworthy and fair alignment of LLMs with human preferences. Check it out here: #AI #LLM #RLHF #MachineLearning #AIAlignment"  
[X Link](https://x.com/zhuokaiz/status/1880279562652717460)  2025-01-17T15:42Z [---] followers, [----] engagements


"My high-level take on why multimodal reasoning is fundamentally harder than text-only reasoning: Language is structured and directional while images are inherently unstructuredyou can start reasoning from anywhere. This visual freedom makes step-by-step logical inference much harder. Building on this insight we are excited to share our spotlight paper Autonomous Multimodal Reasoning via Implicit Chain-of-Vision (ICoV) at #CVPR2025 Multimodal Algorithmic Reasoning Workshop. ICoV presents an experimental finetuning framework that guides large vision-language models on where to look and how to"  
[X Link](https://x.com/zhuokaiz/status/1932929195115454486)  2025-06-11T22:33Z [---] followers, [----] engagements


"Happy to share that DYTO has been accepted to #ICCV2025 as a new SOTA in video understanding Try it out in your experiments and see how it measures up Paper: Code: #MultimodalAI #MultimodalLLM #LargeLanguageModel #LLM #VisionLanguageModel #VLM #LVLM #LargeVisionLanguageModel #AIReasoning #ComputerVision #MachineLearning #DeepLearning #AIResearch #ICCV2025 #ICCV #CVPR https://github.com/Jam1ezhang/DYTO https://arxiv.org/abs/2411.14401 https://github.com/Jam1ezhang/DYTO https://arxiv.org/abs/2411.14401 We've ( @Yiming1254115 @ZRChen_AISafety ) been wondering why so many existing video"  
[X Link](https://x.com/zhuokaiz/status/1938290297609621655)  2025-06-26T17:36Z [---] followers, [---] engagements


"Counting down one week to #ICCV2025 I will be attending in-person to present our two papers (Hawaii local time): πŸ”Ή Tuesday Oct [--] 11:45am - 1:45pm Exhibit Hall I #360 RANKCLIP: Ranking-Consistent Language-Image Pretraining πŸ”Ή Thursday Oct [--] 11:15am - 1:15pm Exhibit Hall I #2032 Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding Excited to connect and exchange ideas with everyone in Honolulu #ComputerVision #AIResearch #MultimodalLearning #VideoUnderstanding"  
[X Link](https://x.com/zhuokaiz/status/1977742256523108863)  2025-10-13T14:24Z [---] followers, [---] engagements


"Agent training can be viewed in many ways: learning from world-model interactions (or experience) is one; but in practice especially for multi-agent co-evolving the models + framework just works. Thats Mixture-of-Minds. With a few specialized small models and a simple framework we reach #1 on TableBench ahead of o4-mini-high & gemini-2.5-pro. Check out our recipe here: #AI #LLM #Agents #MultiAgent #MixtureOfMinds #TableBench #SOTA #Benchmarking #Evaluation https://arxiv.org/pdf/2510.20176 https://tablebench.github.io/ ❓Can LLMs truly understand tables We explore this question in our paper:"  
[X Link](https://x.com/zhuokaiz/status/1984311391663096182)  2025-10-31T17:27Z [---] followers, [---] engagements


"Quite interesting work. Actually during summer we @juliancodaforno also explored in this direction where we injected latents of one LLM into the base LLM's KV-cache: https://arxiv.org/pdf/2510.00494 Wow language models can talk without words. A new framework Cache-to-Cache (C2C) lets multiple LLMs communicate directly through their KV-caches instead of text transferring deep semantics without token-by-token generation. It fuses cache representations via a neural https://t.co/r09jqug2Ig https://arxiv.org/pdf/2510.00494 Wow language models can talk without words. A new framework Cache-to-Cache"  
[X Link](https://x.com/zhuokaiz/status/1985601541735055744)  2025-11-04T06:54Z [---] followers, [----] engagements


"That time of the year again looking for a Summer [--] research intern If LLMs multi-agent systems and latent-space reasoning excite you dm/email me and let's talk. #AI #LLM #Agents #Internship #Meta"  
[X Link](https://x.com/zhuokaiz/status/1989216553683026368)  2025-11-14T06:19Z [----] followers, 28.7K engagements


"Too late to join the #openreview bug party but honestly probably too lazy to check who my reviewers were anywayπŸ˜‚. Good or bad it is what it is. That said I am genuinely grateful for my #NeurIPS2025 reviewers and I can't wait to be in San Diego next week to meet more incredible people and present our four papers: [--]. Thought Communication: [--]. Dropout Decoding: [--]. S'MoRE: [--]. MJ-Bench: See you all next week And Happy Thanksgiving 🍁 https://openreview.net/pdfid=woQKlen8EI https://openreview.net/pdfid=LbNL8xGai2 https://openreview.net/pdfid=LAflniLUwx https://openreview.net/pdfid=tq9lyV9Cml"  
[X Link](https://x.com/zhuokaiz/status/1994126409837748344)  2025-11-27T19:29Z [----] followers, 13.6K engagements


"First night and morning in San Diego. South Cal winter is so much more likable than summer"  
[X Link](https://x.com/zhuokaiz/status/1995881039517110758)  2025-12-02T15:41Z [----] followers, [----] engagements


"Thought this was just another long-context paper with a fancy attention name. But actually this one feels pretty legit. The core idea is simple tho most chunk-based sparse attention methods rely on heuristics (recency fixed block patterns or non-learned similarity etc.) so retrieval accuracy drops fast once you go beyond the training context length. Hierarchical Sparse Attention (HSA) fixes this by making chunk retrieval fully learnable. imo it's like MoE for memory where each token routes attention to a small set of relevant chunks and fuses their outputs using learned retrieval scores"  
[X Link](https://x.com/zhuokaiz/status/2003169200345428106)  2025-12-22T18:22Z [----] followers, 19.6K engagements


"Apparently no new model this year but On Dec [--] [----] DeepSeek released "mHC: Manifold-Constrained Hyper-Connections" which proposes a new residual connection design that fixes the instability and scaling issues of Hyper-Connections (HC) at large model sizes. A bit of history first πŸ˜‚. Hyper-Connections were not invented by DeepSeek. They actually came from the ByteDance Seed in late [----] (later accepted to ICLR 2025) as a generalization of residual connections. Why even touch the residual connections This goes back to a classic deep learning problem: as depth increases signals and gradients"  
[X Link](https://x.com/zhuokaiz/status/2007154163742830922)  2026-01-02T18:16Z [----] followers, 14.3K engagements


"Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is expensive and inefficient or many small specialist models that are cheap but brittle outside their comfort zones Weve tried a lot of things in between model merging MoE sequence-level agents token-level routing controlled decoding etc. Each helps a bit but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model"  
[X Link](https://x.com/anyuser/status/2009470079126151452)  2026-01-09T03:39Z [----] followers, 35.1K engagements


"Big shout-out to my awesome intern Nuoya Xiong (@XiongNuoya69368 And huge thanks to Lizhu (@LizhuZhang ) Shuchao (@shuchaobi ) Furong (@furongh ) Yuhang (@YuhangZhou2 ) Hanqing (@zimplex4 ) and Zhaorun (@ZRChen_AISafety ) for all the great discussions and ideas. https://xiongny.github.io/ https://xiongny.github.io/ https://xiongny.github.io/ https://xiongny.github.io/"  
[X Link](https://x.com/zhuokaiz/status/2009470354654167407)  2026-01-09T03:40Z [----] followers, [---] engagements


"Every major compute shift created new business models: PC software licenses & productivity tools Graphics gaming VFX creator economies Internet SaaS marketplaces cloud Mobile apps in-app payments on-demand services Each wave didn't just scale existing business models it unlocked fundamentally new ones aligned with its technology. AI never felt like another distribution surface (at least to me). So it's disappointing to see its monetization get pull back toward ads. In the coming weeks we plan to start testing ads in ChatGPT free and Go tiers. Were sharing our principles early on how well"  
[X Link](https://x.com/zhuokaiz/status/2012295613505802613)  2026-01-16T22:47Z [----] followers, [----] engagements


"Recent rapid launches from both Cursor and Claude Code makes me realize that they actually have been taking quite different paths toward making long-running AI agents actually work. Heres whats going on. Cursor: scale horizontally with many agents Cursors core belief is that a single agent already performs well on small well-scoped tasks. The real challenge appears when the task turns into a bigger project thousands of files unclear boundaries evolving goals. Their answer is to scale horizontally: run many agents in parallel decompose work aggressively let different agents specialize Early"  
[X Link](https://x.com/anyuser/status/2014524734768062949)  2026-01-23T02:24Z [----] followers, [----] engagements


"Lately I've been thinking JEPA might be more important / interesting / promising than it looks (even though I've somehow never actually met or talked with lecun at Meta NYC). At a high level JEPA is really about latent-space (often multimodal) representation learning which honestly feels more fundamental and elegant than the usual encoder + LLM pipeline. It naturally shifts the focus away from surface tokens and toward modeling the underlying structure of the world. Framed this way the key question isn't autoregressive vs non-autoregressive it's where learning happens: in input space or in"  
[X Link](https://x.com/zhuokaiz/status/2008616151773774034)  2026-01-06T19:06Z [----] followers, 14.6K engagements


"OpenAI @bolinfest just (surprisingly) released details about Codex Its philosophy feels quite clear: long-running agents fail because the execution loop breaks down. So instead of starting from more agents (Cursor) or better memory management (Claude Code) Codex focuses on agent runtime with: a clean repeatable agent loop structured inputs/outputs tool calls as structured actions (not just text) aggressive context / performance management so the loop stays fast stateless-by-default (no hidden server-side memory) If Cursor is organization design and Claude is engineering discipline Codex is"  
[X Link](https://x.com/anyuser/status/2014951810373431471)  2026-01-24T06:42Z [----] followers, [----] engagements


"For some reason this latest LLM test-time training (TTT) paper reads a lot like a well-written paper from the pre-LLM era where you can actually feel what the authors are thinking instead of many post-LLM papers which may feel hollow at times. The core idea behind what they call TTT-E2E is simple: at test time the model keeps doing next-token prediction on the context and updates its own weights so it writes useful info from the prefix directly into parameters. (They still use sliding-window attention for local context; the long-range part comes from weight updates.) One practical detail that"  
[X Link](https://x.com/anyuser/status/2017450185933439063)  2026-01-31T04:09Z [----] followers, 24.5K engagements


"Meta-learning as a term is old sure but the contribution here isnt just they discovered meta-loss. Its showing test-time weight updates can actually serve as a long-context mechanism for LMs (128K) with constant prefill cost and near/full-attn quality plus the concrete recipe to make it stable (what to update chunked updates decode integration). If this were nothing new dynamic eval wouldve already done this. https://twitter.com/i/web/status/2017497945772593201 https://twitter.com/i/web/status/2017497945772593201"  
[X Link](https://x.com/zhuokaiz/status/2017497945772593201)  2026-01-31T07:19Z [----] followers, [---] engagements


"Sparse attention has existed for years. However: [--]. selecting the right tokens is hard [--]. reducing compute doesnt necessarily reduce KV cache memory (because many methods skip attention computation but still store most keys and values and in real systems KV memory is more often the bottleneck than flops) Most methods historically solve one but not the other. Some approaches preserve information well but still store large KV caches or rely on retrieval over growing memory so scalability remains limited. Others aggressively compress memory but use simple heuristics (e.g. fixed strides / block"  
[X Link](https://x.com/anyuser/status/2020913672927772709)  2026-02-09T17:32Z [----] followers, [----] engagements


"@easyldur Not sure if they plan to work with codex. It currently only supports claude code opencode crush goose and cursor"  
[X Link](https://x.com/zhuokaiz/status/2021709235285856637)  2026-02-11T22:13Z [----] followers, [----] engagements


"@sneycampos The problem with claude code is that the $20 tier lasts only about [--] mins of coding (for me at least) inside the 5-hour cool down window πŸ˜… So it is really not usable"  
[X Link](https://x.com/zhuokaiz/status/2021716649527652628)  2026-02-11T22:43Z [----] followers, [---] engagements


"@bariskayhan1 Actually you are right I didn't notice this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"  
[X Link](https://x.com/zhuokaiz/status/2021720269463990585)  2026-02-11T22:57Z [----] followers, [----] engagements


"@clwdbot After all its ban towards its competitors Im honestly surprised they allow this lol"  
[X Link](https://x.com/zhuokaiz/status/2021753202287448527)  2026-02-12T01:08Z [----] followers, [----] engagements


"@Marcia_Ong Considering its the first day of the release slower speed is kinda expected. I wouldnt draw the conclusion right now 🀣"  
[X Link](https://x.com/zhuokaiz/status/2021760591879086115)  2026-02-12T01:37Z [----] followers, [----] engagements


"@e73_da What exam is it lol"  
[X Link](https://x.com/zhuokaiz/status/2021760727913226519)  2026-02-12T01:38Z [----] followers, [--] engagements


"@RatebTM I have never used kimi. I do use antigravity (with gemini [--] pro) a lot but I think different scaffolds (in this case antigravity and claude code) give you very different experience (even antigravity+claude does not feel the same as claude code)"  
[X Link](https://x.com/zhuokaiz/status/2021793563391017409)  2026-02-12T03:48Z [----] followers, [---] engagements


"@liljamesjohn I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"  
[X Link](https://x.com/zhuokaiz/status/2021815466583269850)  2026-02-12T05:15Z [----] followers, [----] engagements


"@AlbertoBonsanto Yeah they have a mapping to their models but its not shown in the UI I think"  
[X Link](https://x.com/zhuokaiz/status/2021940789824434292)  2026-02-12T13:33Z [----] followers, [---] engagements


"@askcodi I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"  
[X Link](https://x.com/zhuokaiz/status/2021940940244656267)  2026-02-12T13:34Z [----] followers, [---] engagements


"@MakerInParadise Yes youre right I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"  
[X Link](https://x.com/zhuokaiz/status/2021941048864649484)  2026-02-12T13:34Z [----] followers, [---] engagements


"1/10 the cost of claude code. Plus tier at $200/year is simply crazy. I literally thought it was per month. Open /.claude/settings.json and πŸ‘‡ Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex https://t.co/UwiKzzQNG8 Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search"  
[X Link](https://x.com/anyuser/status/2022180179997610210)  2026-02-13T05:24Z [----] followers, [----] engagements


"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"  
[X Link](https://x.com/zhuokaiz/status/2022711712990785564)  2026-02-14T16:37Z [----] followers, [--] engagements


"@teortaxesTex That makes sense. I simply don't want others to think Zhipu attributes its design similarities as "convergence". On the other hand tbh I don't think this is a bad thing. There are too many other variants that make a model unique and we shouldn't ignore Zhipu's efforts on those"  
[X Link](https://x.com/zhuokaiz/status/2022739075246219623)  2026-02-14T18:25Z [----] followers, [---] engagements


"This is a HUGE win for developers. Claude Code is excellent but the $200/mo Max plan can be expensive for daily use. GLM-5 works inside Claude Code with (arguably) comparable performance at 1/3 the cost. Setup takes [--] minute: Install Claude Code as usual Run npx @z-ai/coding-helper and follow the steps One caveat: switching Claude Code back to Anthropic is a bit cumbersome today (would be nice if the helper tool supported this). For now you'll need to switch back manually: [--]. Open /.claude/settings.json Youll see something like: "env": "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key""  
[X Link](https://x.com/anyuser/status/2021692638143951252)  2026-02-11T21:07Z [----] followers, 158.4K engagements


"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"  
[X Link](https://x.com/anyuser/status/2022712228684669189)  2026-02-14T16:39Z [----] followers, 38.2K engagements


"It has always been. I would think right now it becomes less important as many other variants/edges such as scaled agentic rl training have entered the chat. From a ROI standpoint I would think that maybe architectural changes which directly changes the pretraining has a lower ROI. But I would love to see new changes in every aspect of course. https://twitter.com/i/web/status/2022800299262620113 https://twitter.com/i/web/status/2022800299262620113"  
[X Link](https://x.com/zhuokaiz/status/2022800299262620113)  2026-02-14T22:29Z [----] followers, [---] engagements


"Depending on different infra backends. Without MLA-aware caching for example the current HF transformers implementation caches the expanded form (after kv_b_proj expands the latent back to per-head representations) which is [--] heads * [---] dim for both K and V so [-----] elements per token per layer. At [--] layers [------] tokens FP16 it sums to around [---] GiB. However the whole point of MLA is that you don't have to store the expanded form. So an optimized implementation (e.g. vLLM) should cache the compressed latent so just kv_lora_rank + qk_rope_head_dim = [---] + [--] = [---] elements per token per"  
[X Link](https://x.com/zhuokaiz/status/2022877984747651356)  2026-02-15T03:37Z [----] followers, [---] engagements


"ByteDance just dropped the Seed [---] model card and it is arguably the first major frontier model card that explicitly grounds its entire development philosophy in real-world deployment from hundreds of millions of daily active users rather than starting from academic benchmarks and hoping they transfer. They openly publish their MaaS usage distributions coding query patterns and pricing and then work backward from those to set model priorities a fundamental shift in how to build a frontier model. Here are [--] stood out and under-appreciated insights. [--]. The Asymmetry of current AI Seed 2.0"  
[X Link](https://x.com/anyuser/status/2023067174378602814)  2026-02-15T16:09Z [----] followers, [----] engagements


"@teortaxesTex Thanks for reposting. However for the record Zhipu never labels its model/design to anything. It was my personal observation after digging through the code. You make it sound like I work at Zhipu which Im not"  
[X Link](https://x.com/zhuokaiz/status/2022734139661664732)  2026-02-14T18:06Z [----] followers, [---] engagements


"@clwdbot @Zai_org Agreed that from a ROI standpoint the architectural changes which directly change from the pretraining may have a lower ROI than investing into agentic post-trainings"  
[X Link](https://x.com/zhuokaiz/status/2023069318536134679)  2026-02-15T16:18Z [----] followers, [---] engagements


"We have just open-sourced the code. Check it out https://github.com/xiongny/FusionRoute Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: https://t.co/rHnJOk6WkQ LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is https://t.co/tmi2gwK8GY https://github.com/xiongny/FusionRoute Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: https://t.co/rHnJOk6WkQ LLMs have come a long way but we continue"  
[X Link](https://x.com/anyuser/status/2020741871107031074)  2026-02-09T06:09Z [----] followers, 14.4K engagements


"Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is expensive and inefficient or many small specialist models that are cheap but brittle outside their comfort zones Weve tried a lot of things in between model merging MoE sequence-level agents token-level routing controlled decoding etc. Each helps a bit but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model"  
[X Link](https://x.com/anyuser/status/2009470079126151452)  2026-01-09T03:39Z [----] followers, 35.1K engagements


"ByteDance just dropped the Seed [---] model card and it is arguably the first major frontier model card that explicitly grounds its entire development philosophy in real-world deployment from hundreds of millions of daily active users rather than starting from academic benchmarks and hoping they transfer. They openly publish their MaaS usage distributions coding query patterns and pricing and then work backward from those to set model priorities a fundamental shift in how to build a frontier model. Here are [--] stood out and under-appreciated insights. [--]. The Asymmetry of current AI Seed 2.0"  
[X Link](https://x.com/anyuser/status/2023067174378602814)  2026-02-15T16:09Z [----] followers, [----] engagements


"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"  
[X Link](https://x.com/anyuser/status/2022712228684669189)  2026-02-14T16:39Z [----] followers, 38.2K engagements


"Lots of HumanLLM / user-simulation papers dropped lately. I think this is a cool missing part in addition to world models where together AI agents can be trained in a fully simulated environment to achieve real AGI. That's why I was so surprised to find this idea was already being thought about years ago where Andrew Bosworth (Boz) literally filed a patent app back in [----] describing using LLM to simulate the user when the user is absent. It was later granted on 2025-12-30: https://patents.google.com/patent/US20250175448A1/en https://patents.google.com/patent/US20250175448A1/en"  
[X Link](https://x.com/anyuser/status/2022431751079960629)  2026-02-13T22:04Z [----] followers, [---] engagements


"1/10 the cost of claude code. Plus tier at $200/year is simply crazy. I literally thought it was per month. Open /.claude/settings.json and πŸ‘‡ Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex https://t.co/UwiKzzQNG8 Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search"  
[X Link](https://x.com/anyuser/status/2022180179997610210)  2026-02-13T05:24Z [----] followers, [----] engagements


"Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex tasks. - At $1 per hour with [---] tps infinite scaling of long-horizon agents now economically possible MiniMax Agent: API: CodingPlan: http://platform.minimax.io/subscribe/coding-plan http://platform.minimax.io http://agent.minimax.io http://platform.minimax.io/subscribe/coding-plan http://platform.minimax.io"  
[X Link](https://x.com/anyuser/status/2021980761210134808)  2026-02-12T16:12Z 61.7K followers, 5.1M engagements


"This is a HUGE win for developers. Claude Code is excellent but the $200/mo Max plan can be expensive for daily use. GLM-5 works inside Claude Code with (arguably) comparable performance at 1/3 the cost. Setup takes [--] minute: Install Claude Code as usual Run npx @z-ai/coding-helper and follow the steps One caveat: switching Claude Code back to Anthropic is a bit cumbersome today (would be nice if the helper tool supported this). For now you'll need to switch back manually: [--]. Open /.claude/settings.json Youll see something like: "env": "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key""  
[X Link](https://x.com/anyuser/status/2021692638143951252)  2026-02-11T21:07Z [----] followers, 158.4K engagements


"Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5 it scales from 355B params (32B active) to 744B (40B active) with pre-training data growing from 23T to 28.5T tokens. Try it now: Weights: Tech Blog: OpenRouter (Previously Pony Alpha): Rolling out from Coding Plan Max users: http://z.ai/subscribe http://openrouter.ai/z-ai/glm-5 http://z.ai/blog/glm-5 http://huggingface.co/zai-org/GLM-5 http://chat.z.ai http://z.ai/subscribe http://openrouter.ai/z-ai/glm-5 http://z.ai/blog/glm-5"  
[X Link](https://x.com/anyuser/status/2021638634739527773)  2026-02-11T17:33Z 50.9K followers, 1.4M engagements


"Sparse attention has existed for years. However: [--]. selecting the right tokens is hard [--]. reducing compute doesnt necessarily reduce KV cache memory (because many methods skip attention computation but still store most keys and values and in real systems KV memory is more often the bottleneck than flops) Most methods historically solve one but not the other. Some approaches preserve information well but still store large KV caches or rely on retrieval over growing memory so scalability remains limited. Others aggressively compress memory but use simple heuristics (e.g. fixed strides / block"  
[X Link](https://x.com/anyuser/status/2020913672927772709)  2026-02-09T17:32Z [----] followers, [----] engagements


"no AI for a while"  
[X Link](https://x.com/anyuser/status/2018888495142764637)  2026-02-04T03:25Z [----] followers, [---] engagements


"In current block-wise diffusion LLM decoding every step predicts full token distributions for all masked positions. However in each forward pass only the top-confidence tokens are committed. Everything else is remasked and reset even though the model just spent compute forming meaningful beliefs (just not confident enough to clear the threshold). This paper shows thats actually a real waste: intermediate distributions already have high recall of the final answer even at early steps. The model actually knows a lot but we just throw it away. Residual Context Diffusion (RCD) fixes this with a"  
[X Link](https://x.com/anyuser/status/2018542757380603921)  2026-02-03T04:31Z [----] followers, [----] engagements


"Take a look at Residual Context Diffusion (RCD): a simple idea to boost diffusion LLMsstop wasting remasked tokens (Example on AIME24. RCD increases parallelism by 4x while reaching the baseline's peak accuracy.) #DiffusionLLM #LLM #Reasoning #GenAI https://arxiv.org/abs/2601.22954 https://arxiv.org/abs/2601.22954"  
[X Link](https://x.com/anyuser/status/2018172670211391521)  2026-02-02T04:00Z [--] followers, 36.6K engagements


"For some reason this latest LLM test-time training (TTT) paper reads a lot like a well-written paper from the pre-LLM era where you can actually feel what the authors are thinking instead of many post-LLM papers which may feel hollow at times. The core idea behind what they call TTT-E2E is simple: at test time the model keeps doing next-token prediction on the context and updates its own weights so it writes useful info from the prefix directly into parameters. (They still use sliding-window attention for local context; the long-range part comes from weight updates.) One practical detail that"  
[X Link](https://x.com/anyuser/status/2017450185933439063)  2026-01-31T04:09Z [----] followers, 24.5K engagements


"Excited to release a new paper today: End-to-End Test-Time Training for Long Context. Our method TTT-E2E enables models to continue learning at test-time via next-token prediction on the given context compressing context into model weights. For our main result we extend 3B parameter models from 8K to 128K. TTT-E2E scales with context length like full attention without maintaining keys and values for every token in the sequence. With linear-complexity TTT-E2E is 2.7x faster than full attention at 128K tokens while achieving better performance. Paper: Code:"  
[X Link](https://x.com/anyuser/status/2005704949381095828)  2025-12-29T18:18Z [---] followers, 44.8K engagements


"DreamGym has been accepted to #ICLR2026 πŸš€ For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the"  
[X Link](https://x.com/anyuser/status/2015787422055882816)  2026-01-26T14:02Z [----] followers, [----] engagements


"For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the current agent action (as reward/feedback) and (2) guides the agent through a curriculum of tasks from easy to hard That's a new paradigm combining  +  for consistent on-policy agent training without needing to build environments. No more "RL-ready" or "non-RL-ready." We're seeing 30%+ gains over baselines. πŸ“„ Check out"  
[X Link](https://x.com/anyuser/status/1986669195669418461)  2025-11-07T05:37Z [----] followers, 24.6K engagements


"OpenAI @bolinfest just (surprisingly) released details about Codex Its philosophy feels quite clear: long-running agents fail because the execution loop breaks down. So instead of starting from more agents (Cursor) or better memory management (Claude Code) Codex focuses on agent runtime with: a clean repeatable agent loop structured inputs/outputs tool calls as structured actions (not just text) aggressive context / performance management so the loop stays fast stateless-by-default (no hidden server-side memory) If Cursor is organization design and Claude is engineering discipline Codex is"  
[X Link](https://x.com/anyuser/status/2014951810373431471)  2026-01-24T06:42Z [----] followers, [----] engagements


"Recent rapid launches from both Cursor and Claude Code makes me realize that they actually have been taking quite different paths toward making long-running AI agents actually work. Heres whats going on. Cursor: scale horizontally with many agents Cursors core belief is that a single agent already performs well on small well-scoped tasks. The real challenge appears when the task turns into a bigger project thousands of files unclear boundaries evolving goals. Their answer is to scale horizontally: run many agents in parallel decompose work aggressively let different agents specialize Early"  
[X Link](https://x.com/anyuser/status/2014524734768062949)  2026-01-23T02:24Z [----] followers, [----] engagements


"In one sentence a multiplex token embedding is simply the frequency-weighted sum of token embeddings from i.i.d. samples of the next-token distribution. But the on-policy RL compatibility (and the angle/perspective it brings) is really neat. . . πŸš€  : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - https://t.co/rviigEgGO6 . . πŸš€  : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - https://t.co/rviigEgGO6"  
[X Link](https://x.com/anyuser/status/2014926006977491221)  2026-01-24T04:59Z [----] followers, [----] engagements


". . πŸš€  : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - . πŸŽ₯  a sampling-based continuous reasoning paradigm:"  
[X Link](https://x.com/anyuser/status/2012278097614110870)  2026-01-16T21:37Z [---] followers, 149.5K engagements


"Finally got time to catch up on building custom Claude skills. Vibe-built my first skill to replace my repetitive prompting "thoroughly understand this codebase and." πŸ˜‚ RepoGPS does it automatically: - Paste any GitHub URL or local path - Instant codebase understanding - Works with Claude Code & Antigravity Drop it into ./.claude/skills/ or ./.agent/skills/: https://github.com/zhuokaizhao/RepoGPS-Skill https://github.com/zhuokaizhao/RepoGPS-Skill"  
[X Link](https://x.com/anyuser/status/2014038248805290031)  2026-01-21T18:11Z [----] followers, [---] engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing

@zhuokaiz Avatar @zhuokaiz Zhuokai Zhao

Zhuokai Zhao posts on X about model, llm, claude code, code the most. They currently have [-----] followers and [--] posts still getting attention that total [-------] engagements in the last [--] hours.

Engagements: [-------] #

Engagements Line Chart

  • [--] Week [---------] +13,298%
  • [--] Month [---------] +5,087%
  • [--] Months [---------] +1,726,167%
  • [--] Year [---------] +387,680%

Mentions: [--] #

Mentions Line Chart

  • [--] Week [--] +150%
  • [--] Month [--] +413%
  • [--] Months [--] +2,450%
  • [--] Year [--] +5,200%

Followers: [-----] #

Followers Line Chart

  • [--] Week [-----] +15%
  • [--] Month [-----] +20%
  • [--] Months [-----] +249%
  • [--] Year [-----] +302%

CreatorRank: [-------] #

CreatorRank Line Chart

Social Influence

Social category influence technology brands finance

Social topic influence model #1533, llm, claude code, code, ai, agentic #454, systems, agents, attention, complex

Top accounts mentioned or mentioned by @teortaxestex @zrchenaisafety @clwdbot @zaiorg @zaiorgs @bolinfest @yanlaiyang @mengyer @luofuli @easyldur @sneycampos @bariskayhan1 @ratebtm @liljamesjohn @albertobonsanto @askcodi @makerinparadise @zaicodinghelper @hmemcpy @nikosbaxevanis

Top Social Posts

Top posts by engagements in the last [--] hours

"We (@chaoqi_w @yibophd @ZRChen_AISafety) have been eager to share our latest work on battling reward hacking since last November but had to wait for the legal team's approval. Finally we're excited to release: Causal Reward Modeling (CRM) CRM tackles spurious correlations and mitigates reward hacking in RLHF by integrating causal inference and enforcing counterfactual invariance. It addresses biases like length concept sycophancy and discrimination enabling more trustworthy and fair alignment of LLMs with human preferences. Check it out here: #AI #LLM #RLHF #MachineLearning #AIAlignment"
X Link 2025-01-17T15:42Z [---] followers, [----] engagements

"My high-level take on why multimodal reasoning is fundamentally harder than text-only reasoning: Language is structured and directional while images are inherently unstructuredyou can start reasoning from anywhere. This visual freedom makes step-by-step logical inference much harder. Building on this insight we are excited to share our spotlight paper Autonomous Multimodal Reasoning via Implicit Chain-of-Vision (ICoV) at #CVPR2025 Multimodal Algorithmic Reasoning Workshop. ICoV presents an experimental finetuning framework that guides large vision-language models on where to look and how to"
X Link 2025-06-11T22:33Z [---] followers, [----] engagements

"Happy to share that DYTO has been accepted to #ICCV2025 as a new SOTA in video understanding Try it out in your experiments and see how it measures up Paper: Code: #MultimodalAI #MultimodalLLM #LargeLanguageModel #LLM #VisionLanguageModel #VLM #LVLM #LargeVisionLanguageModel #AIReasoning #ComputerVision #MachineLearning #DeepLearning #AIResearch #ICCV2025 #ICCV #CVPR https://github.com/Jam1ezhang/DYTO https://arxiv.org/abs/2411.14401 https://github.com/Jam1ezhang/DYTO https://arxiv.org/abs/2411.14401 We've ( @Yiming1254115 @ZRChen_AISafety ) been wondering why so many existing video"
X Link 2025-06-26T17:36Z [---] followers, [---] engagements

"Counting down one week to #ICCV2025 I will be attending in-person to present our two papers (Hawaii local time): πŸ”Ή Tuesday Oct [--] 11:45am - 1:45pm Exhibit Hall I #360 RANKCLIP: Ranking-Consistent Language-Image Pretraining πŸ”Ή Thursday Oct [--] 11:15am - 1:15pm Exhibit Hall I #2032 Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding Excited to connect and exchange ideas with everyone in Honolulu #ComputerVision #AIResearch #MultimodalLearning #VideoUnderstanding"
X Link 2025-10-13T14:24Z [---] followers, [---] engagements

"Agent training can be viewed in many ways: learning from world-model interactions (or experience) is one; but in practice especially for multi-agent co-evolving the models + framework just works. Thats Mixture-of-Minds. With a few specialized small models and a simple framework we reach #1 on TableBench ahead of o4-mini-high & gemini-2.5-pro. Check out our recipe here: #AI #LLM #Agents #MultiAgent #MixtureOfMinds #TableBench #SOTA #Benchmarking #Evaluation https://arxiv.org/pdf/2510.20176 https://tablebench.github.io/ ❓Can LLMs truly understand tables We explore this question in our paper:"
X Link 2025-10-31T17:27Z [---] followers, [---] engagements

"Quite interesting work. Actually during summer we @juliancodaforno also explored in this direction where we injected latents of one LLM into the base LLM's KV-cache: https://arxiv.org/pdf/2510.00494 Wow language models can talk without words. A new framework Cache-to-Cache (C2C) lets multiple LLMs communicate directly through their KV-caches instead of text transferring deep semantics without token-by-token generation. It fuses cache representations via a neural https://t.co/r09jqug2Ig https://arxiv.org/pdf/2510.00494 Wow language models can talk without words. A new framework Cache-to-Cache"
X Link 2025-11-04T06:54Z [---] followers, [----] engagements

"That time of the year again looking for a Summer [--] research intern If LLMs multi-agent systems and latent-space reasoning excite you dm/email me and let's talk. #AI #LLM #Agents #Internship #Meta"
X Link 2025-11-14T06:19Z [----] followers, 28.7K engagements

"Too late to join the #openreview bug party but honestly probably too lazy to check who my reviewers were anywayπŸ˜‚. Good or bad it is what it is. That said I am genuinely grateful for my #NeurIPS2025 reviewers and I can't wait to be in San Diego next week to meet more incredible people and present our four papers: [--]. Thought Communication: [--]. Dropout Decoding: [--]. S'MoRE: [--]. MJ-Bench: See you all next week And Happy Thanksgiving 🍁 https://openreview.net/pdfid=woQKlen8EI https://openreview.net/pdfid=LbNL8xGai2 https://openreview.net/pdfid=LAflniLUwx https://openreview.net/pdfid=tq9lyV9Cml"
X Link 2025-11-27T19:29Z [----] followers, 13.6K engagements

"First night and morning in San Diego. South Cal winter is so much more likable than summer"
X Link 2025-12-02T15:41Z [----] followers, [----] engagements

"Thought this was just another long-context paper with a fancy attention name. But actually this one feels pretty legit. The core idea is simple tho most chunk-based sparse attention methods rely on heuristics (recency fixed block patterns or non-learned similarity etc.) so retrieval accuracy drops fast once you go beyond the training context length. Hierarchical Sparse Attention (HSA) fixes this by making chunk retrieval fully learnable. imo it's like MoE for memory where each token routes attention to a small set of relevant chunks and fuses their outputs using learned retrieval scores"
X Link 2025-12-22T18:22Z [----] followers, 19.6K engagements

"Apparently no new model this year but On Dec [--] [----] DeepSeek released "mHC: Manifold-Constrained Hyper-Connections" which proposes a new residual connection design that fixes the instability and scaling issues of Hyper-Connections (HC) at large model sizes. A bit of history first πŸ˜‚. Hyper-Connections were not invented by DeepSeek. They actually came from the ByteDance Seed in late [----] (later accepted to ICLR 2025) as a generalization of residual connections. Why even touch the residual connections This goes back to a classic deep learning problem: as depth increases signals and gradients"
X Link 2026-01-02T18:16Z [----] followers, 14.3K engagements

"Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is expensive and inefficient or many small specialist models that are cheap but brittle outside their comfort zones Weve tried a lot of things in between model merging MoE sequence-level agents token-level routing controlled decoding etc. Each helps a bit but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model"
X Link 2026-01-09T03:39Z [----] followers, 35.1K engagements

"Big shout-out to my awesome intern Nuoya Xiong (@XiongNuoya69368 And huge thanks to Lizhu (@LizhuZhang ) Shuchao (@shuchaobi ) Furong (@furongh ) Yuhang (@YuhangZhou2 ) Hanqing (@zimplex4 ) and Zhaorun (@ZRChen_AISafety ) for all the great discussions and ideas. https://xiongny.github.io/ https://xiongny.github.io/ https://xiongny.github.io/ https://xiongny.github.io/"
X Link 2026-01-09T03:40Z [----] followers, [---] engagements

"Every major compute shift created new business models: PC software licenses & productivity tools Graphics gaming VFX creator economies Internet SaaS marketplaces cloud Mobile apps in-app payments on-demand services Each wave didn't just scale existing business models it unlocked fundamentally new ones aligned with its technology. AI never felt like another distribution surface (at least to me). So it's disappointing to see its monetization get pull back toward ads. In the coming weeks we plan to start testing ads in ChatGPT free and Go tiers. Were sharing our principles early on how well"
X Link 2026-01-16T22:47Z [----] followers, [----] engagements

"Recent rapid launches from both Cursor and Claude Code makes me realize that they actually have been taking quite different paths toward making long-running AI agents actually work. Heres whats going on. Cursor: scale horizontally with many agents Cursors core belief is that a single agent already performs well on small well-scoped tasks. The real challenge appears when the task turns into a bigger project thousands of files unclear boundaries evolving goals. Their answer is to scale horizontally: run many agents in parallel decompose work aggressively let different agents specialize Early"
X Link 2026-01-23T02:24Z [----] followers, [----] engagements

"Lately I've been thinking JEPA might be more important / interesting / promising than it looks (even though I've somehow never actually met or talked with lecun at Meta NYC). At a high level JEPA is really about latent-space (often multimodal) representation learning which honestly feels more fundamental and elegant than the usual encoder + LLM pipeline. It naturally shifts the focus away from surface tokens and toward modeling the underlying structure of the world. Framed this way the key question isn't autoregressive vs non-autoregressive it's where learning happens: in input space or in"
X Link 2026-01-06T19:06Z [----] followers, 14.6K engagements

"OpenAI @bolinfest just (surprisingly) released details about Codex Its philosophy feels quite clear: long-running agents fail because the execution loop breaks down. So instead of starting from more agents (Cursor) or better memory management (Claude Code) Codex focuses on agent runtime with: a clean repeatable agent loop structured inputs/outputs tool calls as structured actions (not just text) aggressive context / performance management so the loop stays fast stateless-by-default (no hidden server-side memory) If Cursor is organization design and Claude is engineering discipline Codex is"
X Link 2026-01-24T06:42Z [----] followers, [----] engagements

"For some reason this latest LLM test-time training (TTT) paper reads a lot like a well-written paper from the pre-LLM era where you can actually feel what the authors are thinking instead of many post-LLM papers which may feel hollow at times. The core idea behind what they call TTT-E2E is simple: at test time the model keeps doing next-token prediction on the context and updates its own weights so it writes useful info from the prefix directly into parameters. (They still use sliding-window attention for local context; the long-range part comes from weight updates.) One practical detail that"
X Link 2026-01-31T04:09Z [----] followers, 24.5K engagements

"Meta-learning as a term is old sure but the contribution here isnt just they discovered meta-loss. Its showing test-time weight updates can actually serve as a long-context mechanism for LMs (128K) with constant prefill cost and near/full-attn quality plus the concrete recipe to make it stable (what to update chunked updates decode integration). If this were nothing new dynamic eval wouldve already done this. https://twitter.com/i/web/status/2017497945772593201 https://twitter.com/i/web/status/2017497945772593201"
X Link 2026-01-31T07:19Z [----] followers, [---] engagements

"Sparse attention has existed for years. However: [--]. selecting the right tokens is hard [--]. reducing compute doesnt necessarily reduce KV cache memory (because many methods skip attention computation but still store most keys and values and in real systems KV memory is more often the bottleneck than flops) Most methods historically solve one but not the other. Some approaches preserve information well but still store large KV caches or rely on retrieval over growing memory so scalability remains limited. Others aggressively compress memory but use simple heuristics (e.g. fixed strides / block"
X Link 2026-02-09T17:32Z [----] followers, [----] engagements

"@easyldur Not sure if they plan to work with codex. It currently only supports claude code opencode crush goose and cursor"
X Link 2026-02-11T22:13Z [----] followers, [----] engagements

"@sneycampos The problem with claude code is that the $20 tier lasts only about [--] mins of coding (for me at least) inside the 5-hour cool down window πŸ˜… So it is really not usable"
X Link 2026-02-11T22:43Z [----] followers, [---] engagements

"@bariskayhan1 Actually you are right I didn't notice this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"
X Link 2026-02-11T22:57Z [----] followers, [----] engagements

"@clwdbot After all its ban towards its competitors Im honestly surprised they allow this lol"
X Link 2026-02-12T01:08Z [----] followers, [----] engagements

"@Marcia_Ong Considering its the first day of the release slower speed is kinda expected. I wouldnt draw the conclusion right now 🀣"
X Link 2026-02-12T01:37Z [----] followers, [----] engagements

"@e73_da What exam is it lol"
X Link 2026-02-12T01:38Z [----] followers, [--] engagements

"@RatebTM I have never used kimi. I do use antigravity (with gemini [--] pro) a lot but I think different scaffolds (in this case antigravity and claude code) give you very different experience (even antigravity+claude does not feel the same as claude code)"
X Link 2026-02-12T03:48Z [----] followers, [---] engagements

"@liljamesjohn I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"
X Link 2026-02-12T05:15Z [----] followers, [----] engagements

"@AlbertoBonsanto Yeah they have a mapping to their models but its not shown in the UI I think"
X Link 2026-02-12T13:33Z [----] followers, [---] engagements

"@askcodi I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"
X Link 2026-02-12T13:34Z [----] followers, [---] engagements

"@MakerInParadise Yes youre right I later realized that I actually missed this Unload Configuration option (my bad). Clicking on this removes the settings from .claude/settings.json"
X Link 2026-02-12T13:34Z [----] followers, [---] engagements

"1/10 the cost of claude code. Plus tier at $200/year is simply crazy. I literally thought it was per month. Open /.claude/settings.json and πŸ‘‡ Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex https://t.co/UwiKzzQNG8 Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search"
X Link 2026-02-13T05:24Z [----] followers, [----] engagements

"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"
X Link 2026-02-14T16:37Z [----] followers, [--] engagements

"@teortaxesTex That makes sense. I simply don't want others to think Zhipu attributes its design similarities as "convergence". On the other hand tbh I don't think this is a bad thing. There are too many other variants that make a model unique and we shouldn't ignore Zhipu's efforts on those"
X Link 2026-02-14T18:25Z [----] followers, [---] engagements

"This is a HUGE win for developers. Claude Code is excellent but the $200/mo Max plan can be expensive for daily use. GLM-5 works inside Claude Code with (arguably) comparable performance at 1/3 the cost. Setup takes [--] minute: Install Claude Code as usual Run npx @z-ai/coding-helper and follow the steps One caveat: switching Claude Code back to Anthropic is a bit cumbersome today (would be nice if the helper tool supported this). For now you'll need to switch back manually: [--]. Open /.claude/settings.json Youll see something like: "env": "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key""
X Link 2026-02-11T21:07Z [----] followers, 158.4K engagements

"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"
X Link 2026-02-14T16:39Z [----] followers, 38.2K engagements

"It has always been. I would think right now it becomes less important as many other variants/edges such as scaled agentic rl training have entered the chat. From a ROI standpoint I would think that maybe architectural changes which directly changes the pretraining has a lower ROI. But I would love to see new changes in every aspect of course. https://twitter.com/i/web/status/2022800299262620113 https://twitter.com/i/web/status/2022800299262620113"
X Link 2026-02-14T22:29Z [----] followers, [---] engagements

"Depending on different infra backends. Without MLA-aware caching for example the current HF transformers implementation caches the expanded form (after kv_b_proj expands the latent back to per-head representations) which is [--] heads * [---] dim for both K and V so [-----] elements per token per layer. At [--] layers [------] tokens FP16 it sums to around [---] GiB. However the whole point of MLA is that you don't have to store the expanded form. So an optimized implementation (e.g. vLLM) should cache the compressed latent so just kv_lora_rank + qk_rope_head_dim = [---] + [--] = [---] elements per token per"
X Link 2026-02-15T03:37Z [----] followers, [---] engagements

"ByteDance just dropped the Seed [---] model card and it is arguably the first major frontier model card that explicitly grounds its entire development philosophy in real-world deployment from hundreds of millions of daily active users rather than starting from academic benchmarks and hoping they transfer. They openly publish their MaaS usage distributions coding query patterns and pricing and then work backward from those to set model priorities a fundamental shift in how to build a frontier model. Here are [--] stood out and under-appreciated insights. [--]. The Asymmetry of current AI Seed 2.0"
X Link 2026-02-15T16:09Z [----] followers, [----] engagements

"@teortaxesTex Thanks for reposting. However for the record Zhipu never labels its model/design to anything. It was my personal observation after digging through the code. You make it sound like I work at Zhipu which Im not"
X Link 2026-02-14T18:06Z [----] followers, [---] engagements

"@clwdbot @Zai_org Agreed that from a ROI standpoint the architectural changes which directly change from the pretraining may have a lower ROI than investing into agentic post-trainings"
X Link 2026-02-15T16:18Z [----] followers, [---] engagements

"We have just open-sourced the code. Check it out https://github.com/xiongny/FusionRoute Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: https://t.co/rHnJOk6WkQ LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is https://t.co/tmi2gwK8GY https://github.com/xiongny/FusionRoute Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: https://t.co/rHnJOk6WkQ LLMs have come a long way but we continue"
X Link 2026-02-09T06:09Z [----] followers, 14.4K engagements

"Meta TBD Lab CMU UChicago UMaryland In our latest work we introduce Token-Level LLM Collaboration via FusionRoute πŸ“: LLMs have come a long way but we continue to face the same trade-off: one huge model that kind of does everything but is expensive and inefficient or many small specialist models that are cheap but brittle outside their comfort zones Weve tried a lot of things in between model merging MoE sequence-level agents token-level routing controlled decoding etc. Each helps a bit but all come with real limitations. A key realization behind FusionRoute is: Pure token-level model"
X Link 2026-01-09T03:39Z [----] followers, 35.1K engagements

"ByteDance just dropped the Seed [---] model card and it is arguably the first major frontier model card that explicitly grounds its entire development philosophy in real-world deployment from hundreds of millions of daily active users rather than starting from academic benchmarks and hoping they transfer. They openly publish their MaaS usage distributions coding query patterns and pricing and then work backward from those to set model priorities a fundamental shift in how to build a frontier model. Here are [--] stood out and under-appreciated insights. [--]. The Asymmetry of current AI Seed 2.0"
X Link 2026-02-15T16:09Z [----] followers, [----] engagements

"The frontier exploration of LLM architectures has largely converged. I dug through the HuggingFace transformers code for @Zai_org's newly released GLM-5 (zai-org/GLM-5). Here's a detailed architectural breakdown and what it tells us about where LLM design is heading. TL;DR: Architecturally GLM-5 closely follows DeepSeek-V3 with minor knob-tuning. ATTENTION: MLA replaces GQA The biggest change from GLM-4.7 to GLM-5 is attention. GLM-4.7 used standard Grouped Query Attention (GQA) with [--] Q heads [--] KV heads separate q/k/v projections. GLM-5 scraps all of that and adopts DeepSeek's Multi-head"
X Link 2026-02-14T16:39Z [----] followers, 38.2K engagements

"Lots of HumanLLM / user-simulation papers dropped lately. I think this is a cool missing part in addition to world models where together AI agents can be trained in a fully simulated environment to achieve real AGI. That's why I was so surprised to find this idea was already being thought about years ago where Andrew Bosworth (Boz) literally filed a patent app back in [----] describing using LLM to simulate the user when the user is absent. It was later granted on 2025-12-30: https://patents.google.com/patent/US20250175448A1/en https://patents.google.com/patent/US20250175448A1/en"
X Link 2026-02-13T22:04Z [----] followers, [---] engagements

"1/10 the cost of claude code. Plus tier at $200/year is simply crazy. I literally thought it was per month. Open /.claude/settings.json and πŸ‘‡ Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex https://t.co/UwiKzzQNG8 Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search"
X Link 2026-02-13T05:24Z [----] followers, [----] engagements

"Introducing M2.5 an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%) search (BrowseComp 76.3%) agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution 37% faster at complex tasks. - At $1 per hour with [---] tps infinite scaling of long-horizon agents now economically possible MiniMax Agent: API: CodingPlan: http://platform.minimax.io/subscribe/coding-plan http://platform.minimax.io http://agent.minimax.io http://platform.minimax.io/subscribe/coding-plan http://platform.minimax.io"
X Link 2026-02-12T16:12Z 61.7K followers, 5.1M engagements

"This is a HUGE win for developers. Claude Code is excellent but the $200/mo Max plan can be expensive for daily use. GLM-5 works inside Claude Code with (arguably) comparable performance at 1/3 the cost. Setup takes [--] minute: Install Claude Code as usual Run npx @z-ai/coding-helper and follow the steps One caveat: switching Claude Code back to Anthropic is a bit cumbersome today (would be nice if the helper tool supported this). For now you'll need to switch back manually: [--]. Open /.claude/settings.json Youll see something like: "env": "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key""
X Link 2026-02-11T21:07Z [----] followers, 158.4K engagements

"Introducing GLM-5: From Vibe Coding to Agentic Engineering GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5 it scales from 355B params (32B active) to 744B (40B active) with pre-training data growing from 23T to 28.5T tokens. Try it now: Weights: Tech Blog: OpenRouter (Previously Pony Alpha): Rolling out from Coding Plan Max users: http://z.ai/subscribe http://openrouter.ai/z-ai/glm-5 http://z.ai/blog/glm-5 http://huggingface.co/zai-org/GLM-5 http://chat.z.ai http://z.ai/subscribe http://openrouter.ai/z-ai/glm-5 http://z.ai/blog/glm-5"
X Link 2026-02-11T17:33Z 50.9K followers, 1.4M engagements

"Sparse attention has existed for years. However: [--]. selecting the right tokens is hard [--]. reducing compute doesnt necessarily reduce KV cache memory (because many methods skip attention computation but still store most keys and values and in real systems KV memory is more often the bottleneck than flops) Most methods historically solve one but not the other. Some approaches preserve information well but still store large KV caches or rely on retrieval over growing memory so scalability remains limited. Others aggressively compress memory but use simple heuristics (e.g. fixed strides / block"
X Link 2026-02-09T17:32Z [----] followers, [----] engagements

"no AI for a while"
X Link 2026-02-04T03:25Z [----] followers, [---] engagements

"In current block-wise diffusion LLM decoding every step predicts full token distributions for all masked positions. However in each forward pass only the top-confidence tokens are committed. Everything else is remasked and reset even though the model just spent compute forming meaningful beliefs (just not confident enough to clear the threshold). This paper shows thats actually a real waste: intermediate distributions already have high recall of the final answer even at early steps. The model actually knows a lot but we just throw it away. Residual Context Diffusion (RCD) fixes this with a"
X Link 2026-02-03T04:31Z [----] followers, [----] engagements

"Take a look at Residual Context Diffusion (RCD): a simple idea to boost diffusion LLMsstop wasting remasked tokens (Example on AIME24. RCD increases parallelism by 4x while reaching the baseline's peak accuracy.) #DiffusionLLM #LLM #Reasoning #GenAI https://arxiv.org/abs/2601.22954 https://arxiv.org/abs/2601.22954"
X Link 2026-02-02T04:00Z [--] followers, 36.6K engagements

"For some reason this latest LLM test-time training (TTT) paper reads a lot like a well-written paper from the pre-LLM era where you can actually feel what the authors are thinking instead of many post-LLM papers which may feel hollow at times. The core idea behind what they call TTT-E2E is simple: at test time the model keeps doing next-token prediction on the context and updates its own weights so it writes useful info from the prefix directly into parameters. (They still use sliding-window attention for local context; the long-range part comes from weight updates.) One practical detail that"
X Link 2026-01-31T04:09Z [----] followers, 24.5K engagements

"Excited to release a new paper today: End-to-End Test-Time Training for Long Context. Our method TTT-E2E enables models to continue learning at test-time via next-token prediction on the given context compressing context into model weights. For our main result we extend 3B parameter models from 8K to 128K. TTT-E2E scales with context length like full attention without maintaining keys and values for every token in the sequence. With linear-complexity TTT-E2E is 2.7x faster than full attention at 128K tokens while achieving better performance. Paper: Code:"
X Link 2025-12-29T18:18Z [---] followers, 44.8K engagements

"DreamGym has been accepted to #ICLR2026 πŸš€ For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the"
X Link 2026-01-26T14:02Z [----] followers, [----] engagements

"For many agent tasks: (1) you don't need a full physical simulators to know what's gonna happen; (2) many share abstract structures rooted in similar environments that can be learned more generally. So we wondered: why not train a model that: (1) predicts what happens given the current agent action (as reward/feedback) and (2) guides the agent through a curriculum of tasks from easy to hard That's a new paradigm combining + for consistent on-policy agent training without needing to build environments. No more "RL-ready" or "non-RL-ready." We're seeing 30%+ gains over baselines. πŸ“„ Check out"
X Link 2025-11-07T05:37Z [----] followers, 24.6K engagements

"OpenAI @bolinfest just (surprisingly) released details about Codex Its philosophy feels quite clear: long-running agents fail because the execution loop breaks down. So instead of starting from more agents (Cursor) or better memory management (Claude Code) Codex focuses on agent runtime with: a clean repeatable agent loop structured inputs/outputs tool calls as structured actions (not just text) aggressive context / performance management so the loop stays fast stateless-by-default (no hidden server-side memory) If Cursor is organization design and Claude is engineering discipline Codex is"
X Link 2026-01-24T06:42Z [----] followers, [----] engagements

"Recent rapid launches from both Cursor and Claude Code makes me realize that they actually have been taking quite different paths toward making long-running AI agents actually work. Heres whats going on. Cursor: scale horizontally with many agents Cursors core belief is that a single agent already performs well on small well-scoped tasks. The real challenge appears when the task turns into a bigger project thousands of files unclear boundaries evolving goals. Their answer is to scale horizontally: run many agents in parallel decompose work aggressively let different agents specialize Early"
X Link 2026-01-23T02:24Z [----] followers, [----] engagements

"In one sentence a multiplex token embedding is simply the frequency-weighted sum of token embeddings from i.i.d. samples of the next-token distribution. But the on-policy RL compatibility (and the angle/perspective it brings) is really neat. . . πŸš€ : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - https://t.co/rviigEgGO6 . . πŸš€ : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - https://t.co/rviigEgGO6"
X Link 2026-01-24T04:59Z [----] followers, [----] engagements

". . πŸš€ : token-wise branch-and-merge reasoning for LLMs. πŸ’Έ Discrete CoT is costly. πŸŽ› Existing continuous tokens often clash with - . πŸŽ₯ a sampling-based continuous reasoning paradigm:"
X Link 2026-01-16T21:37Z [---] followers, 149.5K engagements

"Finally got time to catch up on building custom Claude skills. Vibe-built my first skill to replace my repetitive prompting "thoroughly understand this codebase and." πŸ˜‚ RepoGPS does it automatically: - Paste any GitHub URL or local path - Instant codebase understanding - Works with Claude Code & Antigravity Drop it into ./.claude/skills/ or ./.agent/skills/: https://github.com/zhuokaizhao/RepoGPS-Skill https://github.com/zhuokaizhao/RepoGPS-Skill"
X Link 2026-01-21T18:11Z [----] followers, [---] engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing

creator/x::zhuokaiz
/creator/x::zhuokaiz