#  @YouJiacheng You Jiacheng You Jiacheng posts on X about ai, if you, open ai, in the the most. They currently have [------] followers and [---] posts still getting attention that total [------] engagements in the last [--] hours. ### Engagements: [------] [#](/creator/twitter::3426190841/interactions)  - [--] Week [---------] +781% - [--] Month [---------] +464% - [--] Months [----------] +250% - [--] Year [----------] +378% ### Mentions: [--] [#](/creator/twitter::3426190841/posts_active)  - [--] Week [--] +106% - [--] Month [---] +99% - [--] Months [---] +206% - [--] Year [---] +7.20% ### Followers: [------] [#](/creator/twitter::3426190841/followers)  - [--] Week [------] +1.70% - [--] Month [------] +5.50% - [--] Months [------] +22% - [--] Year [------] +91% ### CreatorRank: [-------] [#](/creator/twitter::3426190841/influencer_rank)  ### Social Influence **Social category influence** [technology brands](/list/technology-brands) 14% [finance](/list/finance) 11% [stocks](/list/stocks) 5% [currencies](/list/currencies) 4% [countries](/list/countries) 3% [social networks](/list/social-networks) 2% [gaming](/list/gaming) 1% [travel destinations](/list/travel-destinations) 1% **Social topic influence** [ai](/topic/ai) 9%, [if you](/topic/if-you) 6%, [open ai](/topic/open-ai) 5%, [in the](/topic/in-the) 5%, [money](/topic/money) 5%, [core](/topic/core) #1555, [token](/topic/token) 3%, [inference](/topic/inference) 3%, [this is](/topic/this-is) 3%, [paradigm](/topic/paradigm) #56 **Top accounts mentioned or mentioned by** [@fahimtajwar10](/creator/undefined) [@damekdavis](/creator/undefined) [@teortaxestex](/creator/undefined) [@allvods](/creator/undefined) [@zaiorg](/creator/undefined) [@jietang](/creator/undefined) [@bigglesworthx](/creator/undefined) [@minimaxai](/creator/undefined) [@hallerite](/creator/undefined) [@iclrconf](/creator/undefined) [@mike64t](/creator/undefined) [@changjonathanc](/creator/undefined) [@jeremyphoward](/creator/undefined) [@zephyrz9](/creator/undefined) [@agarwl](/creator/undefined) [@eify](/creator/undefined) [@mike64_t](/creator/undefined) [@srashedll](/creator/undefined) [@openai](/creator/undefined) [@chhillee](/creator/undefined) **Top assets mentioned** [Alphabet Inc Class A (GOOGL)](/topic/$googl) [Shopify Inc (SHOP)](/topic/$shop) ### Top Social Posts Top posts by engagements in the last [--] hours "Modern network traffic is mostly video because humans can only consume text at a very slow rate. However AI agents can read text [----] faster (if not [-----] faster) than humans. What will happen" [X Link](https://x.com/YouJiacheng/status/2022417246195429384) 2026-02-13T21:07Z 10.1K followers, [----] engagements "I think this change shouldn't be interpreted in this way. logits are shift invariant. imo it has two effects: [--]. guide lm_head's output towards negative. baseline is already biased towards [--] this change explicitly guides it. [--]. change gradient at [--] (init) New NanoGPT Speedrun WR at [-----] (-2.9s) by simple [--] line update to rescale and shift the logit softcapping. Since logits are not zero-centered and prediction only cares about the right tail asymmetric logit rescaling was explored. https://t.co/RKRatppRR6 https://t.co/rzGTGQpNlX New NanoGPT Speedrun WR at [-----] (-2.9s) by simple [--] line" [X Link](https://x.com/YouJiacheng/status/2004881585263518013) 2025-12-27T11:46Z 10K followers, [----] engagements "since the optimization dynamics will be different it seems not working well (maybe it can work with some tuning e.g. take care of initialization range of logits etc.; but I don't think that is worthwhile.) Yes spot on For n=4 any doubly stochastic matrix W has an exact decomposition (Birkhoff-von Neumann): W = sum_i=124 alpha_i P_i with alpha_i geq [--] sum alpha_i = [--] and P_i the [--] permutation matrices. Just set alpha = textsoftmax(textlogits) over [--] Yes spot on For n=4 any doubly stochastic matrix W has an exact decomposition (Birkhoff-von Neumann): W = sum_i=124 alpha_i P_i with alpha_i geq" [X Link](https://x.com/YouJiacheng/status/2007731779936104803) 2026-01-04T08:32Z 10K followers, [----] engagements "specifically you might need to re-design the static "bias" term. simply adding bias to logits might not work (if you want to init the matrix close to identity the bias term of logits will make the gradient to weights on other permutation matrices very small)" [X Link](https://x.com/YouJiacheng/status/2007733063338602568) 2026-01-04T08:37Z 10K followers, [---] engagements "I love Engram's context-aware gating design. Its symmetry is beautiful. It's a mixture of embeddings where routers are also embeddings. Regular MoE is "mixture of FFNs where routers are also FFNs" but KeyFFNs are bias only. k_i=KeyFFN_i(x) v_i=ValFFN_i(x) o=sum_i (qk_i)v_i" [X Link](https://x.com/YouJiacheng/status/2010869081629769815) 2026-01-13T00:18Z 10K followers, 16.7K engagements "interesting journey: originally: mimetic init for V-O after experiments: just small random init for O (so all QKVO share the same init now no special zero init for O). New NanoGPT Speedrun WR at [----] (-1.2s) from @srashedll with an update to the attention initialization. Motivated by mimetic initialization techniques experiments uncovered that small random init outperformed zero init on attention out projection. https://t.co/cWPtOwRyMV New NanoGPT Speedrun WR at [----] (-1.2s) from @srashedll with an update to the attention initialization. Motivated by mimetic initialization techniques" [X Link](https://x.com/YouJiacheng/status/2017942335850774675) 2026-02-01T12:45Z [----] followers, [----] engagements "interesting. it's well known that you can derive a steepest decent method with a local quadratic model (instead of a linearization). we know that Muon is already a steepest decent method but we can further combine other penalty terms. Today I learned from GPT-5.2 that spectral clipping can be seen as replacing the linearized loss assumption in Muon by a local quadratic model with isotropic Hessian instead https://t.co/Qe135eGFu0 Today I learned from GPT-5.2 that spectral clipping can be seen as replacing the linearized loss assumption in Muon by a local quadratic model with isotropic Hessian" [X Link](https://x.com/YouJiacheng/status/2017944005837987921) 2026-02-01T12:51Z [----] followers, [----] engagements "btw this is their HQ (under construction). ๐Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general https://t.co/cKni28WwQT ๐Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general" [X Link](https://x.com/YouJiacheng/status/2019114041504067999) 2026-02-04T18:21Z [----] followers, 160.1K engagements "New Sparse Attention from Xiaomi mimo. I love this design except the block(64) level selection. HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing https://arxiv.org/abs/2602.03560 https://arxiv.org/abs/2602.03560" [X Link](https://x.com/YouJiacheng/status/2019154300975579505) 2026-02-04T21:01Z [----] followers, [----] engagements "5.3 only uses 48% tokens of [---] (both xhigh) and has 25% higher TPS. the wallclock speedup is +160% (2.6 speed) crazy. GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here. https://t.co/vlOmyxIJmv GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here. https://t.co/vlOmyxIJmv" [X Link](https://x.com/YouJiacheng/status/2019481809638289800) 2026-02-05T18:42Z [----] followers, 28.4K engagements "I just checked Appendix D and found this experiment was done with SGD. I originally assumed Adam (with small eps) so I didn't expect it's that bad. @FahimTajwar10 maybe try Adam so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. https://t.co/G8jADLb0NG so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. https://t.co/G8jADLb0NG" [X Link](https://x.com/YouJiacheng/status/2019533395496431891) 2026-02-05T22:07Z [----] followers, 11.2K engagements "@hallerite but you can only choose one y* (for a given x/question) and maximize log p(y=y*) if you add label smoothing it means the desirable reward is not deterministic -- this doesn't make sense" [X Link](https://x.com/YouJiacheng/status/2019897245450285215) 2026-02-06T22:13Z [----] followers, [--] engagements "wtf FlexAttention can be repurposed for MoE computation cc @cHHillee Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform https://t.co/29P4Oz1BDm Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token" [X Link](https://x.com/YouJiacheng/status/2020555070816002519) 2026-02-08T17:47Z [----] followers, 32.7K engagements "the experiments feel strange. (both end2end training and per-component runtime benchmark) [----] tokens [---] experts wallclock time but the number of GPUs is not mentioned (dense MFU is 3% if [--] GPUs) [----] steps warmup [----] steps stable [----] steps decay (10B/0.66M=15151) Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform https://t.co/29P4Oz1BDm Introducing Multi-Head LatentMoE ๐" [X Link](https://x.com/YouJiacheng/status/2020560173950202096) 2026-02-08T18:07Z [----] followers, [----] engagements "it took me [--] minutes to understand what does "counterfactuality" mean in this paper. then I tried [--] non-frontier LLMs they all one-shot it even without thinking. โจTwo single author papers accepted to ICLR 2026โจ Truly excited to present these results at #ICLR2026 @iclr_conf https://t.co/Ptf6oCEblP โจTwo single author papers accepted to ICLR 2026โจ Truly excited to present these results at #ICLR2026 @iclr_conf https://t.co/Ptf6oCEblP" [X Link](https://x.com/YouJiacheng/status/2020879683219542053) 2026-02-09T15:17Z [----] followers, 62.5K engagements ""deterministic training methods"๐ค Meta's next model is codenamed Avocado and it already beats the best open-source models before any fine-tuning or RLHF. Just pretraining. Internal docs say it's 10x more efficient than Maverick and 100x more efficient than Behemoth the unreleased LLaMA [--] flagship. They attribute Meta's next model is codenamed Avocado and it already beats the best open-source models before any fine-tuning or RLHF. Just pretraining. Internal docs say it's 10x more efficient than Maverick and 100x more efficient than Behemoth the unreleased LLaMA [--] flagship. They attribute" [X Link](https://x.com/YouJiacheng/status/2020891806288998547) 2026-02-09T16:05Z [----] followers, [----] engagements "OpenRouter stat shows K2.5 has a 100:1 in-out ratio. Assuming 98% cache hit rate 1.7M output tokens corresponding to 3.4M cache miss & 166.6M cache hit. Using Moonshot's pricing we get $23.8/d $714/mo. BUT: prefill on Mac is not free. You need [--] Mac Studios with M3 Ultra to run Kimi K2.5 locally at [--] TPS. At that speed you can barely run a single agent not a "swarm" or an "army." Kimi K2.5 costs $1.5/M output via OpenRouter. Even if you max out [--] TPS 24/7 that's 1.7M tokens/day which is equivalent of You need [--] Mac Studios with M3 Ultra to run Kimi K2.5 locally at [--] TPS. At that speed you" [X Link](https://x.com/YouJiacheng/status/2020896107182817550) 2026-02-09T16:22Z [----] followers, [----] engagements "does this hold on hyperball optimizers Derivation of optimal LR schedule based on the regime of capacity and task difficulty. In the hard regime WSD is optimal. Very similar conclusion to https://t.co/S3iFaUofXp. https://t.co/FmYu7YsD54 Derivation of optimal LR schedule based on the regime of capacity and task difficulty. In the hard regime WSD is optimal. Very similar conclusion to https://t.co/S3iFaUofXp. https://t.co/FmYu7YsD54" [X Link](https://x.com/YouJiacheng/status/2020904167607501010) 2026-02-09T16:54Z [----] followers, [----] engagements "who designed this UI/UX WHY zai use more space to show a meaningless animation than the terminal AND I can't adjust the height of terminal AND I can't scroll up the terminal when the agent is running (it will be immediately scrolled down) @Zai_org @jietang" [X Link](https://x.com/YouJiacheng/status/2021682289319792804) 2026-02-11T20:26Z 10K followers, [----] engagements "@Zai_org @jietang" [X Link](https://x.com/YouJiacheng/status/2021682974685856041) 2026-02-11T20:29Z 10K followers, [----] engagements "@Zai_org @jietang the result is pretty good but UI/UX is terrible and wtf the agent uses heredoc to directly write python script in terminal https://chat.z.ai/s/0972127f-73e6-47be-973b-daed97175899 https://chat.z.ai/s/0972127f-73e6-47be-973b-daed97175899" [X Link](https://x.com/YouJiacheng/status/2021685328734826971) 2026-02-11T20:38Z 10K followers, [---] engagements "@mike64_t i found that your built a lot of custom DL "libraries" what's your favorite way to implement autograd i kinda want to build one for myself" [X Link](https://x.com/YouJiacheng/status/2021700248213881048) 2026-02-11T21:37Z 10K followers, [---] engagements "@mike64_t yup the backward can still be done in a layer-by-layer manner we merely need to maintain a larger "state" when we iterate over layers. but it might be more prone to error (typo / forget to update etc.) maybe an integration test is needed" [X Link](https://x.com/YouJiacheng/status/2021711635384349158) 2026-02-11T22:23Z [----] followers, [---] engagements "but it's significantly more expensive than deepseek-v3.2 with similar or lower TPS why GLM [---] was 32B active while GLM [--] is 40B active Inference is also cheaper due to DSA Meanwhile they have increased the price substantially to increase gross margins As a proud Zhipu shareholder since IPO I approve https://t.co/jlf3zz2SKQ GLM [---] was 32B active while GLM [--] is 40B active Inference is also cheaper due to DSA Meanwhile they have increased the price substantially to increase gross margins As a proud Zhipu shareholder since IPO I approve https://t.co/jlf3zz2SKQ" [X Link](https://x.com/YouJiacheng/status/2021719674577441158) 2026-02-11T22:55Z 10K followers, [----] engagements "Codeforces results is "no tools" So Gemini [---] Deep Think cannot write test cases to test its solution before submission I guess even the top1 human can't get [----] under this condition. An updated & faster Gemini [--] Deep Think is taking off ๐ Our smartest mode to date PhD-level reasoning to the most rigorous STEM challenges (models' gotta think harder). Gold medal-level results on Physics & Chemistry Olympiads. ๐งช๐ป Full details: https://t.co/9XBQwHSCYW https://t.co/aOlbjI8RKo An updated & faster Gemini [--] Deep Think is taking off ๐ Our smartest mode to date PhD-level reasoning to the most" [X Link](https://x.com/YouJiacheng/status/2021985843074994534) 2026-02-12T16:32Z 10K followers, 23.5K engagements "hi could you copy&paste the [---] lines core to a standalone gist i kinda don't want to dig around in the repo. I'm extremely proud to share that the @symbolica research team has achieved a monumental result in program synthesis. We have been able to reach SOTA on ARC-AGI-2 (85.28% @ $6.94/task) using @agenticasdk as a neurosymbolic program synthesis engine. This engine is not ARC I'm extremely proud to share that the @symbolica research team has achieved a monumental result in program synthesis. We have been able to reach SOTA on ARC-AGI-2 (85.28% @ $6.94/task) using @agenticasdk as a" [X Link](https://x.com/YouJiacheng/status/2022036229844611141) 2026-02-12T19:52Z 10K followers, [----] engagements "๐AI professor (male) on XHS be like: https://www.xiaohongshu.com/user/profile/5f9ce0c10000000001004d49 https://www.xiaohongshu.com/user/profile/5f9ce0c10000000001004d49" [X Link](https://x.com/YouJiacheng/status/2021991658360021367) 2026-02-12T16:55Z 10.1K followers, 130.9K engagements "@BigglesworthX haha didn't analyze it at all (whether it's AI or not I can only see picture so it doesn't matter). coincidentally it's literally *AI* professor๐ (didn't realize it's AI when I shared the pictures I merely thought it's filter-maxx cuz photos in other posts are different)" [X Link](https://x.com/YouJiacheng/status/2022491674216399279) 2026-02-14T02:02Z 10.1K followers, [---] engagements "it's still climbing higher. 459B input and 2.6B output 176:1.๐ค Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1" [X Link](https://x.com/YouJiacheng/status/2023066287321325921) 2026-02-15T16:06Z 10.1K followers, 10.9K engagements "MLA works well in single turn scenario prefill once in MHA mode then switch to MQA mode and decode. However in agent era you need to prefill many times. Switching back to MHA mode is expensive but prefilling in MQA mode will use tons of FLOPs. Will Kimi still use MLA The prefill computation cost is [--] higher. Will Kimi still use MLA The prefill computation cost is [--] higher" [X Link](https://x.com/YouJiacheng/status/2023155193576308981) 2026-02-15T21:59Z 10.1K followers, 12.6K engagements "but this will introduce train-serve mismatch: your actual users use your API in a stateless way they won't say "hey this message belongs to this trajectory/session" so you can't record actions" [X Link](https://x.com/YouJiacheng/status/2022398926331220233) 2026-02-13T19:54Z 10.1K followers, [---] engagements "why it can't be accessed directly (needs vercel account and request access) https://inferencemax.ai/ $1000 for whoever comes up with the best name replacement for InferenceMAX InferenceMAX [---] dropping soon but we have to rename it because HBO MAX sent us a cease and desist. We have all NVIDIA GPUs from h100 to GB300 on large MoEs with SOTA optimizations like Disagg PD tested https://inferencemax.ai/ $1000 for whoever comes up with the best name replacement for InferenceMAX InferenceMAX [---] dropping soon but we have to rename it because HBO MAX sent us a cease and desist. We have all NVIDIA" [X Link](https://x.com/YouJiacheng/status/2023252745617371165) 2026-02-16T04:26Z 10.1K followers, [----] engagements "why because I think this is muon-adam done right. muon is msign-sgd. adam is sign-sgd with SNR adaptive step-size (see or https://arxiv.org/abs/2505.21829 https://kexue.fm/archives/11593 this one looks perfectly correct. https://t.co/7msI3ISAaE https://t.co/DWpdMFo2LE https://arxiv.org/abs/2505.21829 https://kexue.fm/archives/11593 this one looks perfectly correct. https://t.co/7msI3ISAaE https://t.co/DWpdMFo2LE" [X Link](https://x.com/YouJiacheng/status/2019161069391081927) 2026-02-04T21:28Z 10K followers, 17K engagements "No we still can't estimate. 32B dense and 70B dense have a similar TPS. Maverick (400B@17B) [----] vs. 235B@22B [----]. It seems to be KV read bound (Qwen full attn Maverick hybrid w/ chunked attn). GPT-5.3-Codex-Spark size: 700B@30B OpenAI's new GPT-5.3-Codex-Spark is the first model for which we can somewhat reliably estimate its size. Cerebras inference: [----] tokens/s - GLM-4.7 is 355@32B [--] layers [----] tokens/s - Qwen3-235B is 235@22B [--] layers [----] tokens/s - https://t.co/VXmL1eppAs GPT-5.3-Codex-Spark size: 700B@30B OpenAI's new GPT-5.3-Codex-Spark is the first model for which we can" [X Link](https://x.com/YouJiacheng/status/2022034996614275534) 2026-02-12T19:48Z 10.1K followers, [----] engagements "obviously deepseek's new model behaves like GPT-5.x making role-play/companion/writing users really upset. deepseek should not do this. SCOOP: OpenAI warned lawmakers in a memo sent today to the House Select Committee on China that DeepSeek is using new obfuscated methods to continue to distill its AI models as well as those of other US frontier AI labs https://t.co/OsWxPRMF28 w/ @eastland_maggie SCOOP: OpenAI warned lawmakers in a memo sent today to the House Select Committee on China that DeepSeek is using new obfuscated methods to continue to distill its AI models as well as those of other" [X Link](https://x.com/YouJiacheng/status/2022110369721028788) 2026-02-13T00:47Z 10.1K followers, 10.5K engagements "and Kuaishou" [X Link](https://x.com/YouJiacheng/status/2022274703470301599) 2026-02-13T11:40Z 10.1K followers, [----] engagements "theoretically if the agent scaffold itself is append-only we can achieve a consistent tokenization by: [--]. record action tokens [--]. split input string by actions [--]. tokenize non-action substrings [--]. concat with action tokens This is nonintrusive (can be done with a middleware). Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy Why MM needs prefix merge Because they adopt" [X Link](https://x.com/YouJiacheng/status/2022369817823719553) 2026-02-13T17:58Z 10.1K followers, [----] engagements "amazing. [--] is enough 7/ Second payoff: once [--] heads handle retrieval the SSM backbone doesn't need to compensate. SSM state dimension can shrink [--] from [--] to [--] with limited loss. Large SSM states were partly a crutch for missing retrieval. https://t.co/yo7WFxgTga 7/ Second payoff: once [--] heads handle retrieval the SSM backbone doesn't need to compensate. SSM state dimension can shrink [--] from [--] to [--] with limited loss. Large SSM states were partly a crutch for missing retrieval. https://t.co/yo7WFxgTga" [X Link](https://x.com/YouJiacheng/status/2022404367639535973) 2026-02-13T20:15Z 10.1K followers, [----] engagements "One caveat: Llama-3.2-1B has 1632=512 attn heads but only 168=128 kv heads; Qwen2.5-1.5B has 2812=336 attn heads but only 282=56 kv heads. ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's performance on recall math and more. Here's how ๐ https://t.co/1UqcVaSCzi ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the" [X Link](https://x.com/YouJiacheng/status/2022407889466626558) 2026-02-13T20:29Z 10.1K followers, [----] engagements "@ChangJonathanC IIUC in agent RL the prompt itself is small compared to tool call results" [X Link](https://x.com/YouJiacheng/status/2022408522353508551) 2026-02-13T20:32Z 10K followers, [---] engagements "@damekdavis @FahimTajwar10 oh I mean Figure [--] shows that ln(t) + [--] and t have similar derivatives around t=1 which t(RL) is a first-order approximation of ln(t) + 1(MaxRL). ofc the math here is common sense but the caption matters: "REINFORCE loss vs. log loss"" [X Link](https://x.com/YouJiacheng/status/2022445320249262142) 2026-02-13T22:58Z 10K followers, [---] engagements "@damekdavis @FahimTajwar10 so Davis *vaguely* showed this first-order approx relationship exists; Tajwar interpreted log loss as Maximum Likelihood RL (with great motivating exps on ImageNet) and clearly pointed out the approx relationship and empirically shows the advantage of log loss" [X Link](https://x.com/YouJiacheng/status/2022446046308376953) 2026-02-13T23:01Z 10K followers, [---] engagements "nice. to be precise @andrewbriand8 merge lm_head into the FusedSoftcapCE autograd function so that the triton kernel of SoftcapCEBackward can directly store quantized fp8 gradient w.r.t. logits to gmem i.e. the quantization is fused into CE's backward kernel. New NanoGPT Speedrun WR at [----] (-3.3s) from the kernel sorcerer @andrewbriand8 with a triton kernel to fuse the fp8 quantization of the gradient into the backward kernel. (Lm_head calc is ran in fp8) On pace for 120% MFU by July. https://t.co/QXtT8kCMuC New NanoGPT Speedrun WR at [----] (-3.3s) from the kernel sorcerer @andrewbriand8 with" [X Link](https://x.com/YouJiacheng/status/2021030853271486589) 2026-02-10T01:17Z 10.1K followers, [----] engagements "@jeremyphoward isn't it just the consistency of chain rule and i think if it were not done in an elementwise manner it will be obviously inefficient to store the local jacobian compared to storing a VJP lambda (with ctx captured)" [X Link](https://x.com/YouJiacheng/status/2022081392612667836) 2026-02-12T22:52Z 10.1K followers, [----] engagements "which one is better anon Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy" [X Link](https://x.com/YouJiacheng/status/2022402991807172984) 2026-02-13T20:10Z 10.1K followers, [----] engagements "interesting. can we extend it to agents with the "edit" action (i.e. the agent can progressively improve its final deliverable -- not only internal belief matters external files also matter) ๐ก Key idea: ๐ Use the change in the agents belief about the correct answer as a dense intrinsic reward. If an action increases: log p(target history) reward it. We call this Belief-RL. No critic. No process reward model. Just the agent judging its own progress. (2/8) https://t.co/9qYYQxZRRh ๐ก Key idea: ๐ Use the change in the agents belief about the correct answer as a dense intrinsic reward. If an" [X Link](https://x.com/YouJiacheng/status/2022414580358709542) 2026-02-13T20:56Z 10.1K followers, [----] engagements "standalone pic 1:" [X Link](https://x.com/YouJiacheng/status/2022432307559461103) 2026-02-13T22:06Z 10.1K followers, [----] engagements "are you sure the index url is instead of https://wheels.vllm.ai/rocm/0.15.1/rocm700 http://lnkd.in/gJdnn3kJ โ Day [--] support of MiniMax-2.5 on AMD GPU [--] MI300X GPUs are all you need instead of [--] Hopper GPUs to run it in full context. uv pip install vllm --extra-index-url https://t.co/EKXrFnZV5P VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.5 --tensor-parallel-size [--] https://t.co/uVIh2f5hHv https://wheels.vllm.ai/rocm/0.15.1/rocm700 http://lnkd.in/gJdnn3kJ โ Day [--] support of MiniMax-2.5 on AMD GPU [--] MI300X GPUs are all you need instead of [--] Hopper GPUs to run it in full context. uv pip" [X Link](https://x.com/YouJiacheng/status/2022432882086809816) 2026-02-13T22:09Z 10.1K followers, [----] engagements "It takes more time so if Aristotle runs on @cerebras it will be probably faster than reading the NL proof. I've recently started handing off llm generated proofs to Aristotle with some success. It takes more time than reading the proof myself but I can at least avoid the effort / disappointment of reading something incorrect and do something else in the meantime. I've recently started handing off llm generated proofs to Aristotle with some success. It takes more time than reading the proof myself but I can at least avoid the effort / disappointment of reading something incorrect and do" [X Link](https://x.com/YouJiacheng/status/2022452060139188403) 2026-02-13T23:25Z 10.1K followers, [----] engagements "LOL I just realized that Minimax actually released the English version of this blog earlier https://x.com/MiniMax_AI/status/2022175400093462661 Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex https://x.com/MiniMax_AI/status/2022175400093462661 Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout" [X Link](https://x.com/YouJiacheng/status/2022457717261275385) 2026-02-13T23:47Z 10.1K followers, [----] engagements "shocked Minimax's mobile app has no chat functionality now you can only use agent. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1" [X Link](https://x.com/YouJiacheng/status/2022786508097552881) 2026-02-14T21:34Z 10.1K followers, [----] engagements "in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it" [X Link](https://x.com/YouJiacheng/status/2020550157268590998) 2026-02-08T17:27Z 10.1K followers, 16.8K engagements "wow Meituan requires at least 50% code needs to be generated by AI" [X Link](https://x.com/YouJiacheng/status/2022274167593369681) 2026-02-13T11:38Z 10.1K followers, [----] engagements "Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day cc @teortaxesTex https://zhuanlan.zhihu.com/p/2005742716252861435 https://zhuanlan.zhihu.com/p/2005742716252861435" [X Link](https://x.com/YouJiacheng/status/2022339475049947576) 2026-02-13T15:57Z 10.1K followers, 31.5K engagements "Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward" [X Link](https://x.com/YouJiacheng/status/2022364214116192723) 2026-02-13T17:36Z 10.1K followers, [----] engagements "the idea is straightforward but the result is jaw-dropping. Respect to the team. Salute.๐ซก ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's performance on recall math and more. Here's how ๐ https://t.co/1UqcVaSCzi ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's" [X Link](https://x.com/YouJiacheng/status/2022405973705732556) 2026-02-13T20:22Z 10.1K followers, 13.4K engagements "I doubt that the Trump administration has a closer to pro-human value than Xi. The @DarioAmodei interview. 0:00:00 - What exactly are we scaling 0:12:36 - Is diffusion cope 0:29:42 - Is continual learning necessary 0:46:20 - If AGI is imminent why not buy more compute 0:58:49 - How will AI labs actually make profit 1:31:19 - Will regulations destroy https://t.co/qsFoNMAy2t The @DarioAmodei interview. 0:00:00 - What exactly are we scaling 0:12:36 - Is diffusion cope 0:29:42 - Is continual learning necessary 0:46:20 - If AGI is imminent why not buy more compute 0:58:49 - How will AI labs" [X Link](https://x.com/YouJiacheng/status/2022435191831277631) 2026-02-13T22:18Z 10.1K followers, [----] engagements "Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it. https://t.co/lwjxLKb08O in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it." [X Link](https://x.com/YouJiacheng/status/2022751248861274520) 2026-02-14T19:14Z 10.1K followers, 21.2K engagements "trademark issue" [X Link](https://x.com/YouJiacheng/status/2023258193863045282) 2026-02-16T04:48Z 10.1K followers, [---] engagements "Zai just disclosed its revenue and R&D cost. In 2025H1 Zai spent RMB 1.145B renting compute for R&D. Using $2/GPU/h H100 for estimation the cost is the same as 18k H100s. https://www1.hkexnews.hk/app/sehk/2025/107977/documents/sehk25121901699_c.pdf https://www1.hkexnews.hk/app/sehk/2025/107977/documents/sehk25121901699_c.pdf" [X Link](https://x.com/YouJiacheng/status/2002126695562948665) 2025-12-19T21:19Z [----] followers, 29.8K engagements "checked comments in the PR it seems that "not zero-centered" refer to logits before rescaling that make sense" [X Link](https://x.com/YouJiacheng/status/2004883910547894498) 2025-12-27T11:55Z 10K followers, [---] engagements "@zephyr_z9 in their 2025H1 report "silver-free metallization" is not in the "" (used in mass production) category" [X Link](https://x.com/YouJiacheng/status/2005670792814903560) 2025-12-29T16:02Z [----] followers, [---] engagements "oh google is actually earlier than Bytedance lol. MHC - AltUp Engram - Per Layer Embedding + Ngrammer Both from Gemma-3n MHC - AltUp Engram - Per Layer Embedding + Ngrammer Both from Gemma-3n" [X Link](https://x.com/YouJiacheng/status/2010770607005483177) 2026-01-12T17:47Z [----] followers, 17.4K engagements "Infra is the most important thing. Faster iteration and less bugs. n+e shared the same thought in this podcast hosted by WhyNotTV. https://www.bilibili.com/video/BV1darmBcE4A ๐ถ Some (perhaps) spicy thoughts. Its been a while since my last tweet but I wanted to write about how disorienting it has been from academia to an LLM lab ๐ The kind of research I was trained to do during my PhD almost doesnt exist here. The obsession with mathematical https://www.bilibili.com/video/BV1darmBcE4A ๐ถ Some (perhaps) spicy thoughts. Its been a while since my last tweet but I wanted to write about how" [X Link](https://x.com/YouJiacheng/status/2012528722306277547) 2026-01-17T14:13Z [----] followers, 16K engagements "@jiayi_pirate ๐๐ฎ๐บ" [X Link](https://x.com/YouJiacheng/status/2012569032726028571) 2026-01-17T16:53Z [----] followers, [---] engagements "$1.2 per H100 hour is not a bad business if you don't need to upgrade GPUs. This means you can make $8k per year per H100 after considering $0.28 per H100 hour OpEx. If @realDonaldTrump can lower the interest rate it's easy to cover CapEx in the lifetime of GPUs. Hi everyone who says AI compute is getting cheaper. OpenAI has made about $1.20 per kWh of compute each year for the last [--] years. Zero cost reduction in [--] years. Conveniently takes about [--] kW to run an H100 so they make $1.20 per GPU hour. Reserved instances run anywhere Hi everyone who says AI compute is getting cheaper. OpenAI has" [X Link](https://x.com/YouJiacheng/status/2013183305454354846) 2026-01-19T09:34Z [----] followers, [----] engagements "cuz we have so many embed params now I want to share what I previously found about the bwd of the embed. It is split into three steps all incur significant memory access. [--]. zero a fp32 buffer of size (V d) [--]. atomic add to this buffer [--]. cast it to bf16. maybe we can do sth New NanoGPT Speedrun WR at 99.3s (-5.6s) with a bigram hash embedding that is added to the residual stream before every layer. Inspiration from Svenstrup et al [----] paper on Hash Embeddings and Deepseek's Engram. Modded-NanoGPT now uses fewer training tokens than its parameter https://t.co/ykGVKtrOM0 New NanoGPT Speedrun" [X Link](https://x.com/YouJiacheng/status/2013530159647990042) 2026-01-20T08:32Z [----] followers, [----] engagements "btw you can order food via QwenApp and Alibaba provides special coupons for the Qwen Agent. Two days ago OpenAI clarified the transaction fee it will charge Shopify merchants with its Instant Checkout product: 4%. This reinforces my argument that (independent) agentic commerce is a mirage: many Shopify merchants run on incredibly thin margins (3-8% net) and simply https://t.co/F0ZEBVqTfH Two days ago OpenAI clarified the transaction fee it will charge Shopify merchants with its Instant Checkout product: 4%. This reinforces my argument that (independent) agentic commerce is a mirage: many" [X Link](https://x.com/YouJiacheng/status/2015119249824731322) 2026-01-24T17:47Z [----] followers, [----] engagements "this is classified as an attack New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack. https://t.co/44mYnxFKzr New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack. https://t.co/44mYnxFKzr" [X Link](https://x.com/anyuser/status/2015991036804464989) 2026-01-27T03:31Z [----] followers, [----] engagements "interesting Another strategy infers meaning using sets. We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation. Red bars (a) show the positive set and blue boxes (b) show the negative. https://t.co/iVgl13fMUo Another strategy infers meaning using sets. We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation. Red bars (a) show the positive set and blue boxes (b) show the negative. https://t.co/iVgl13fMUo" [X Link](https://x.com/YouJiacheng/status/2015992635325087808) 2026-01-27T03:37Z [----] followers, [----] engagements "I searched and found a review mentioning "FlexPrune" but the text is different. It's How does this differ concretely from FlexPrune or ShearedLLaMA beyond empirical tuning instead of Missing concrete comparison to FlexPrune. https://openreview.net/forumid=DOLF9TUBfa¬eId=Wsd2aIEOvW Reviewer: hallucinates a baseline that doesn't exist. Meta-reviewer: cites the hallucination as the paper's fatal flaw. Decision: reject. @iclr_conf is amazing https://t.co/Z0oWxerqiD https://openreview.net/forumid=DOLF9TUBfa¬eId=Wsd2aIEOvW Reviewer: hallucinates a baseline that doesn't exist. Meta-reviewer:" [X Link](https://x.com/YouJiacheng/status/2015997060357750959) 2026-01-27T03:55Z [----] followers, [----] engagements "So Missing concrete comparison to FlexPrune is from AC Bruh" [X Link](https://x.com/YouJiacheng/status/2015997610650435966) 2026-01-27T03:57Z [----] followers, [---] engagements "Some negative results: [--]. It still cannot answer some hard knowledge queries that K2 can't answer (while Opus [---] can). Its pretraining is probably the same as K2. [--]. It has severe hallucinations in advanced vision use cases" [X Link](https://x.com/YouJiacheng/status/2016022034510496133) 2026-01-27T05:34Z [----] followers, [---] engagements "Interesting price change. cc @zephyr_z9 ๐ฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐น Global SOTA on Agentic Benchmarks: HLE full set (50.2%) BrowseComp (74.9%) ๐น Open-source SOTA on Vision and Coding: MMMU Pro (78.5%) VideoMMMU (86.6%) SWE-bench Verified (76.8%) ๐น Code with Taste: turn chats https://t.co/wp6JZS47bN ๐ฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐น Global SOTA on Agentic Benchmarks: HLE full set (50.2%) BrowseComp (74.9%) ๐น Open-source SOTA on Vision and Coding: MMMU Pro (78.5%) VideoMMMU (86.6%) SWE-bench Verified (76.8%) ๐น Code with Taste:" [X Link](https://x.com/YouJiacheng/status/2016028878763130884) 2026-01-27T06:01Z [----] followers, 28.3K engagements "Engineering VP of Dify" [X Link](https://x.com/YouJiacheng/status/2016047442790384038) 2026-01-27T07:15Z [----] followers, 31.3K engagements "oh the actual title should be "Product & R&D VP"" [X Link](https://x.com/YouJiacheng/status/2016051665506545975) 2026-01-27T07:32Z [----] followers, [----] engagements "dude the TV price changes are mainly due to quality adjustment. I wonder if U.S. can design a suitable quality adjustment for healthcare so that President Trump can eliminate inflation in healthcare. Whoever is in charge of TV prices should be put in charge of healthcare education and housing prices https://t.co/Npk4LEuaQK Whoever is in charge of TV prices should be put in charge of healthcare education and housing prices https://t.co/Npk4LEuaQK" [X Link](https://x.com/YouJiacheng/status/2016272450292744410) 2026-01-27T22:09Z [----] followers, [----] engagements "RT @samsja19: Today were releasing Trinity Large a 400B MoE LLM with 13B active parameters trained over 17T tokens The base model is on" [X Link](https://x.com/YouJiacheng/status/2016320255141019662) 2026-01-28T01:19Z [----] followers, [--] engagements "super clean it's live leading the pre-train on this was the most ambitious thing i've ever had the chance to do. we've got a full tech report and a brilliant loss graph to go with it https://t.co/Ail5Rxp711 it's live leading the pre-train on this was the most ambitious thing i've ever had the chance to do. we've got a full tech report and a brilliant loss graph to go with it https://t.co/Ail5Rxp711" [X Link](https://x.com/YouJiacheng/status/2016416293159895421) 2026-01-28T07:41Z [----] followers, [----] engagements "I thought trinity means Arcee Prime Intellect Datology but it seems that Datology is not in the author list lol. how can you not be romantic about open models https://t.co/weBpiRB2i3 how can you not be romantic about open models https://t.co/weBpiRB2i3" [X Link](https://x.com/YouJiacheng/status/2016421269793866236) 2026-01-28T08:01Z 10K followers, [----] engagements "Wow this is uncommon for a humanoid research. Usually the reproduced performance is inferior (and some research just do not provide deployment code.) but in this video the motion is very smooth. Looks like some friends already made it happen in their labs ๐๐ค https://t.co/Sh9eru79y1 Looks like some friends already made it happen in their labs ๐๐ค https://t.co/Sh9eru79y1" [X Link](https://x.com/YouJiacheng/status/2016439199877238975) 2026-01-28T09:12Z [----] followers, [----] engagements "TIL: CSGO players should use golang and VALORANT players should use tilelang. Because golang is go tilelang is " [X Link](https://x.com/YouJiacheng/status/2016454652586602941) 2026-01-28T10:13Z [----] followers, [----] engagements "money problem is more severe than GPUs problem in China. OpenAI burnt like 200B RMB in [----] while Kimi proudly talked about they have 10B RMB cash. China will solve GPUs before the US solves energy I think people are a bit confused on this front. The US will *not* hit the energy bottleneck for AI until very late in the game if ever. Energy-wise Speciale RL (300K H800-hours) cost like [---] MWh (assuming mediocre PUE) China will solve GPUs before the US solves energy I think people are a bit confused on this front. The US will *not* hit the energy bottleneck for AI until very late in the game if" [X Link](https://x.com/YouJiacheng/status/2016498630346334221) 2026-01-28T13:08Z [----] followers, 47.5K engagements "@teortaxesTex it's true that ByteDance has enough money. but I don't think startups can have enough money even there are enough GPUs" [X Link](https://x.com/YouJiacheng/status/2016506786216570996) 2026-01-28T13:40Z [----] followers, [---] engagements "1) What Kimi [---] tops DesignArena overall beating the likes of Gemini [--] Pro and Claude Opus [---] by quite some margin. The individual charts have not been updated as yet so cannot tell what categories it excels out but it tops [--] of them. https://t.co/wqqxZSwiCJ Kimi [---] tops DesignArena overall beating the likes of Gemini [--] Pro and Claude Opus [---] by quite some margin. The individual charts have not been updated as yet so cannot tell what categories it excels out but it tops [--] of them. https://t.co/wqqxZSwiCJ" [X Link](https://x.com/YouJiacheng/status/2016507151263613039) 2026-01-28T13:42Z [----] followers, [----] engagements "RT @damekdavis: We've now reached 30+ problems in the repo. Problem 11b has been solved by @PI010101 and collaborators I've learned a l" [X Link](https://x.com/YouJiacheng/status/2016837304288256227) 2026-01-29T11:34Z [----] followers, [--] engagements "RT @bigeagle_xd: ๐Congrats Training Trinity Large on 2k B300 GPUs I envy you American guys ๐ฅฒ" [X Link](https://x.com/YouJiacheng/status/2016905456808317337) 2026-01-29T16:05Z [----] followers, [--] engagements "๐Learning by cheating + DAgger (a.k.a. OPD) is kinda standard in robot learning (include AD). #TBThursday https://t.co/792y2vK7oX #TBThursday https://t.co/792y2vK7oX" [X Link](https://x.com/YouJiacheng/status/2016911979580195238) 2026-01-29T16:31Z [----] followers, [----] engagements "yep the backward part is solidly novel. @YangYou1991 My immediate point is that there actually is meaningful novelty in FlashAttention that was more non-obvious than online softmax and tiling in the backwards (the delta computation). Just as a frame of reference I know of 2(3) other groups who discoveres similar-ish ideas for the @YangYou1991 My immediate point is that there actually is meaningful novelty in FlashAttention that was more non-obvious than online softmax and tiling in the backwards (the delta computation). Just as a frame of reference I know of 2(3) other groups who discoveres" [X Link](https://x.com/YouJiacheng/status/2017105545854169419) 2026-01-30T05:20Z [----] followers, [----] engagements "lol true. if our gov give me 10k I will save at least 9k. 100k I will save 95k. The PRC blocks smuggling subsidized cars to Russia. A funny feature of the Chinese economy is that the Party does *everything* to boost consoomption everything except give the people more money that is. I see their angle. Give a Chinese free money he'll just go save it. https://t.co/3Nq1gUMDmj The PRC blocks smuggling subsidized cars to Russia. A funny feature of the Chinese economy is that the Party does *everything* to boost consoomption everything except give the people more money that is. I see their angle." [X Link](https://x.com/YouJiacheng/status/2017479814245155188) 2026-01-31T06:07Z [----] followers, [----] engagements "๐ฅฒI found that K2.5 is basically unusable after 3-5 turns of error-feedback loop. If it can't fix a problem within [--] turns it basically can't fix it" [X Link](https://x.com/YouJiacheng/status/2017699907906502965) 2026-01-31T20:41Z [----] followers, 53K engagements "this rumor looks technically wrong to me. you simply can't provide the bandwidth with disaggregated memory even connected by optics. Rumor: Starting with TPU v8 Google will no longer use HBM The incident was triggered by the global capacity shortage of HBM which will be unable to meet AI growth demands over the next [--] to [--] years. At the same time traditional HBM is limited by its design of being fixed on Rumor: Starting with TPU v8 Google will no longer use HBM The incident was triggered by the global capacity shortage of HBM which will be unable to meet AI growth demands over the next [--] to 3" [X Link](https://x.com/YouJiacheng/status/2018564757058699308) 2026-02-03T05:58Z [----] followers, [----] engagements "@rupspace @Kimi_Moonshot tbf grok-4's ECI is also [---] and was released [--] months earlier" [X Link](https://x.com/YouJiacheng/status/2019163440892506555) 2026-02-04T21:37Z [----] followers, [---] engagements "@zeroXmusashi it's a national lab. Central government and Shanghai government fund it" [X Link](https://x.com/YouJiacheng/status/2019181844667986106) 2026-02-04T22:50Z [----] followers, [----] engagements "@agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss in addition to its MC estimation a.k.a. REINFORCE" [X Link](https://x.com/YouJiacheng/status/2019517867352535267) 2026-02-05T21:05Z [----] followers, [----] engagements "so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. @agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss in addition to its MC estimation a.k.a. REINFORCE. @agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss" [X Link](https://x.com/YouJiacheng/status/2019519888768004106) 2026-02-05T21:13Z [----] followers, 13.5K engagements "@FahimTajwar10 @AllVods I considered the vanishing gradient problem but I thought that as long as we use adam (with small enough eps) only relative gradient size matters the absolute gradient size doesn't. Just checked appendix D I found it's optimized by SGD okay that makes sense" [X Link](https://x.com/YouJiacheng/status/2019532350791774272) 2026-02-05T22:03Z [----] followers, [--] engagements "@FahimTajwar10 @AllVods I said "relative gradient size matters". but it should not be *that bad*" [X Link](https://x.com/YouJiacheng/status/2019534564394430777) 2026-02-05T22:12Z [----] followers, [--] engagements "@FahimTajwar10 @AllVods the whole point is we can't apply SGD with same HP to a problem with [--] OOMs smaller gradient norm. (maybe even Adam can't work without tuning) I know p loss is worse than log p (otherwise we won't use CE). But we should compare them with tuned optimizer HPs" [X Link](https://x.com/YouJiacheng/status/2019536008346497079) 2026-02-05T22:18Z [----] followers, [--] engagements "@FahimTajwar10 my suggestion is that you can keep using SGD for other methods (especially your own methods). if a reasonably large search space and amount of tuning effort can't make p-loss/RL work in image classification then we can safely conclude that it's not good" [X Link](https://x.com/YouJiacheng/status/2019540482347745747) 2026-02-05T22:35Z [----] followers, [---] engagements Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
@YouJiacheng You JiachengYou Jiacheng posts on X about ai, if you, open ai, in the the most. They currently have [------] followers and [---] posts still getting attention that total [------] engagements in the last [--] hours.
Social category influence technology brands 14% finance 11% stocks 5% currencies 4% countries 3% social networks 2% gaming 1% travel destinations 1%
Social topic influence ai 9%, if you 6%, open ai 5%, in the 5%, money 5%, core #1555, token 3%, inference 3%, this is 3%, paradigm #56
Top accounts mentioned or mentioned by @fahimtajwar10 @damekdavis @teortaxestex @allvods @zaiorg @jietang @bigglesworthx @minimaxai @hallerite @iclrconf @mike64t @changjonathanc @jeremyphoward @zephyrz9 @agarwl @eify @mike64_t @srashedll @openai @chhillee
Top assets mentioned Alphabet Inc Class A (GOOGL) Shopify Inc (SHOP)
Top posts by engagements in the last [--] hours
"Modern network traffic is mostly video because humans can only consume text at a very slow rate. However AI agents can read text [----] faster (if not [-----] faster) than humans. What will happen"
X Link 2026-02-13T21:07Z 10.1K followers, [----] engagements
"I think this change shouldn't be interpreted in this way. logits are shift invariant. imo it has two effects: [--]. guide lm_head's output towards negative. baseline is already biased towards [--] this change explicitly guides it. [--]. change gradient at [--] (init) New NanoGPT Speedrun WR at [-----] (-2.9s) by simple [--] line update to rescale and shift the logit softcapping. Since logits are not zero-centered and prediction only cares about the right tail asymmetric logit rescaling was explored. https://t.co/RKRatppRR6 https://t.co/rzGTGQpNlX New NanoGPT Speedrun WR at [-----] (-2.9s) by simple [--] line"
X Link 2025-12-27T11:46Z 10K followers, [----] engagements
"since the optimization dynamics will be different it seems not working well (maybe it can work with some tuning e.g. take care of initialization range of logits etc.; but I don't think that is worthwhile.) Yes spot on For n=4 any doubly stochastic matrix W has an exact decomposition (Birkhoff-von Neumann): W = sum_i=124 alpha_i P_i with alpha_i geq [--] sum alpha_i = [--] and P_i the [--] permutation matrices. Just set alpha = textsoftmax(textlogits) over [--] Yes spot on For n=4 any doubly stochastic matrix W has an exact decomposition (Birkhoff-von Neumann): W = sum_i=124 alpha_i P_i with alpha_i geq"
X Link 2026-01-04T08:32Z 10K followers, [----] engagements
"specifically you might need to re-design the static "bias" term. simply adding bias to logits might not work (if you want to init the matrix close to identity the bias term of logits will make the gradient to weights on other permutation matrices very small)"
X Link 2026-01-04T08:37Z 10K followers, [---] engagements
"I love Engram's context-aware gating design. Its symmetry is beautiful. It's a mixture of embeddings where routers are also embeddings. Regular MoE is "mixture of FFNs where routers are also FFNs" but KeyFFNs are bias only. k_i=KeyFFN_i(x) v_i=ValFFN_i(x) o=sum_i (qk_i)v_i"
X Link 2026-01-13T00:18Z 10K followers, 16.7K engagements
"interesting journey: originally: mimetic init for V-O after experiments: just small random init for O (so all QKVO share the same init now no special zero init for O). New NanoGPT Speedrun WR at [----] (-1.2s) from @srashedll with an update to the attention initialization. Motivated by mimetic initialization techniques experiments uncovered that small random init outperformed zero init on attention out projection. https://t.co/cWPtOwRyMV New NanoGPT Speedrun WR at [----] (-1.2s) from @srashedll with an update to the attention initialization. Motivated by mimetic initialization techniques"
X Link 2026-02-01T12:45Z [----] followers, [----] engagements
"interesting. it's well known that you can derive a steepest decent method with a local quadratic model (instead of a linearization). we know that Muon is already a steepest decent method but we can further combine other penalty terms. Today I learned from GPT-5.2 that spectral clipping can be seen as replacing the linearized loss assumption in Muon by a local quadratic model with isotropic Hessian instead https://t.co/Qe135eGFu0 Today I learned from GPT-5.2 that spectral clipping can be seen as replacing the linearized loss assumption in Muon by a local quadratic model with isotropic Hessian"
X Link 2026-02-01T12:51Z [----] followers, [----] engagements
"btw this is their HQ (under construction). ๐Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general https://t.co/cKni28WwQT ๐Introducing Intern-S1-Pro an advanced 1T MoE open-source multimodal scientific reasoning model. 1SOTA scientific reasoning competitive with leading closed-source models across AI4Science tasks. 2Top-tier performance on advanced reasoning benchmarks strong general"
X Link 2026-02-04T18:21Z [----] followers, 160.1K engagements
"New Sparse Attention from Xiaomi mimo. I love this design except the block(64) level selection. HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing https://arxiv.org/abs/2602.03560 https://arxiv.org/abs/2602.03560"
X Link 2026-02-04T21:01Z [----] followers, [----] engagements
"5.3 only uses 48% tokens of [---] (both xhigh) and has 25% higher TPS. the wallclock speedup is +160% (2.6 speed) crazy. GPT-5.3-Codex's much better token efficiency AND faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here. https://t.co/vlOmyxIJmv GPT-5.3-Codex's much better token efficiency AND faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here. https://t.co/vlOmyxIJmv"
X Link 2026-02-05T18:42Z [----] followers, 28.4K engagements
"I just checked Appendix D and found this experiment was done with SGD. I originally assumed Adam (with small eps) so I didn't expect it's that bad. @FahimTajwar10 maybe try Adam so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. https://t.co/G8jADLb0NG so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. https://t.co/G8jADLb0NG"
X Link 2026-02-05T22:07Z [----] followers, 11.2K engagements
"@hallerite but you can only choose one y* (for a given x/question) and maximize log p(y=y*) if you add label smoothing it means the desirable reward is not deterministic -- this doesn't make sense"
X Link 2026-02-06T22:13Z [----] followers, [--] engagements
"wtf FlexAttention can be repurposed for MoE computation cc @cHHillee Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform https://t.co/29P4Oz1BDm Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token"
X Link 2026-02-08T17:47Z [----] followers, 32.7K engagements
"the experiments feel strange. (both end2end training and per-component runtime benchmark) [----] tokens [---] experts wallclock time but the number of GPUs is not mentioned (dense MFU is 3% if [--] GPUs) [----] steps warmup [----] steps stable [----] steps decay (10B/0.66M=15151) Introducing Multi-Head LatentMoE ๐ Turns out making NVIDIA's LatentMoE [--] multi-head further unlocks O(1) balanced and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform https://t.co/29P4Oz1BDm Introducing Multi-Head LatentMoE ๐"
X Link 2026-02-08T18:07Z [----] followers, [----] engagements
"it took me [--] minutes to understand what does "counterfactuality" mean in this paper. then I tried [--] non-frontier LLMs they all one-shot it even without thinking. โจTwo single author papers accepted to ICLR 2026โจ Truly excited to present these results at #ICLR2026 @iclr_conf https://t.co/Ptf6oCEblP โจTwo single author papers accepted to ICLR 2026โจ Truly excited to present these results at #ICLR2026 @iclr_conf https://t.co/Ptf6oCEblP"
X Link 2026-02-09T15:17Z [----] followers, 62.5K engagements
""deterministic training methods"๐ค Meta's next model is codenamed Avocado and it already beats the best open-source models before any fine-tuning or RLHF. Just pretraining. Internal docs say it's 10x more efficient than Maverick and 100x more efficient than Behemoth the unreleased LLaMA [--] flagship. They attribute Meta's next model is codenamed Avocado and it already beats the best open-source models before any fine-tuning or RLHF. Just pretraining. Internal docs say it's 10x more efficient than Maverick and 100x more efficient than Behemoth the unreleased LLaMA [--] flagship. They attribute"
X Link 2026-02-09T16:05Z [----] followers, [----] engagements
"OpenRouter stat shows K2.5 has a 100:1 in-out ratio. Assuming 98% cache hit rate 1.7M output tokens corresponding to 3.4M cache miss & 166.6M cache hit. Using Moonshot's pricing we get $23.8/d $714/mo. BUT: prefill on Mac is not free. You need [--] Mac Studios with M3 Ultra to run Kimi K2.5 locally at [--] TPS. At that speed you can barely run a single agent not a "swarm" or an "army." Kimi K2.5 costs $1.5/M output via OpenRouter. Even if you max out [--] TPS 24/7 that's 1.7M tokens/day which is equivalent of You need [--] Mac Studios with M3 Ultra to run Kimi K2.5 locally at [--] TPS. At that speed you"
X Link 2026-02-09T16:22Z [----] followers, [----] engagements
"does this hold on hyperball optimizers Derivation of optimal LR schedule based on the regime of capacity and task difficulty. In the hard regime WSD is optimal. Very similar conclusion to https://t.co/S3iFaUofXp. https://t.co/FmYu7YsD54 Derivation of optimal LR schedule based on the regime of capacity and task difficulty. In the hard regime WSD is optimal. Very similar conclusion to https://t.co/S3iFaUofXp. https://t.co/FmYu7YsD54"
X Link 2026-02-09T16:54Z [----] followers, [----] engagements
"who designed this UI/UX WHY zai use more space to show a meaningless animation than the terminal AND I can't adjust the height of terminal AND I can't scroll up the terminal when the agent is running (it will be immediately scrolled down) @Zai_org @jietang"
X Link 2026-02-11T20:26Z 10K followers, [----] engagements
"@Zai_org @jietang"
X Link 2026-02-11T20:29Z 10K followers, [----] engagements
"@Zai_org @jietang the result is pretty good but UI/UX is terrible and wtf the agent uses heredoc to directly write python script in terminal https://chat.z.ai/s/0972127f-73e6-47be-973b-daed97175899 https://chat.z.ai/s/0972127f-73e6-47be-973b-daed97175899"
X Link 2026-02-11T20:38Z 10K followers, [---] engagements
"@mike64_t i found that your built a lot of custom DL "libraries" what's your favorite way to implement autograd i kinda want to build one for myself"
X Link 2026-02-11T21:37Z 10K followers, [---] engagements
"@mike64_t yup the backward can still be done in a layer-by-layer manner we merely need to maintain a larger "state" when we iterate over layers. but it might be more prone to error (typo / forget to update etc.) maybe an integration test is needed"
X Link 2026-02-11T22:23Z [----] followers, [---] engagements
"but it's significantly more expensive than deepseek-v3.2 with similar or lower TPS why GLM [---] was 32B active while GLM [--] is 40B active Inference is also cheaper due to DSA Meanwhile they have increased the price substantially to increase gross margins As a proud Zhipu shareholder since IPO I approve https://t.co/jlf3zz2SKQ GLM [---] was 32B active while GLM [--] is 40B active Inference is also cheaper due to DSA Meanwhile they have increased the price substantially to increase gross margins As a proud Zhipu shareholder since IPO I approve https://t.co/jlf3zz2SKQ"
X Link 2026-02-11T22:55Z 10K followers, [----] engagements
"Codeforces results is "no tools" So Gemini [---] Deep Think cannot write test cases to test its solution before submission I guess even the top1 human can't get [----] under this condition. An updated & faster Gemini [--] Deep Think is taking off ๐ Our smartest mode to date PhD-level reasoning to the most rigorous STEM challenges (models' gotta think harder). Gold medal-level results on Physics & Chemistry Olympiads. ๐งช๐ป Full details: https://t.co/9XBQwHSCYW https://t.co/aOlbjI8RKo An updated & faster Gemini [--] Deep Think is taking off ๐ Our smartest mode to date PhD-level reasoning to the most"
X Link 2026-02-12T16:32Z 10K followers, 23.5K engagements
"hi could you copy&paste the [---] lines core to a standalone gist i kinda don't want to dig around in the repo. I'm extremely proud to share that the @symbolica research team has achieved a monumental result in program synthesis. We have been able to reach SOTA on ARC-AGI-2 (85.28% @ $6.94/task) using @agenticasdk as a neurosymbolic program synthesis engine. This engine is not ARC I'm extremely proud to share that the @symbolica research team has achieved a monumental result in program synthesis. We have been able to reach SOTA on ARC-AGI-2 (85.28% @ $6.94/task) using @agenticasdk as a"
X Link 2026-02-12T19:52Z 10K followers, [----] engagements
"๐AI professor (male) on XHS be like: https://www.xiaohongshu.com/user/profile/5f9ce0c10000000001004d49 https://www.xiaohongshu.com/user/profile/5f9ce0c10000000001004d49"
X Link 2026-02-12T16:55Z 10.1K followers, 130.9K engagements
"@BigglesworthX haha didn't analyze it at all (whether it's AI or not I can only see picture so it doesn't matter). coincidentally it's literally AI professor๐ (didn't realize it's AI when I shared the pictures I merely thought it's filter-maxx cuz photos in other posts are different)"
X Link 2026-02-14T02:02Z 10.1K followers, [---] engagements
"it's still climbing higher. 459B input and 2.6B output 176:1.๐ค Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1"
X Link 2026-02-15T16:06Z 10.1K followers, 10.9K engagements
"MLA works well in single turn scenario prefill once in MHA mode then switch to MQA mode and decode. However in agent era you need to prefill many times. Switching back to MHA mode is expensive but prefilling in MQA mode will use tons of FLOPs. Will Kimi still use MLA The prefill computation cost is [--] higher. Will Kimi still use MLA The prefill computation cost is [--] higher"
X Link 2026-02-15T21:59Z 10.1K followers, 12.6K engagements
"but this will introduce train-serve mismatch: your actual users use your API in a stateless way they won't say "hey this message belongs to this trajectory/session" so you can't record actions"
X Link 2026-02-13T19:54Z 10.1K followers, [---] engagements
"why it can't be accessed directly (needs vercel account and request access) https://inferencemax.ai/ $1000 for whoever comes up with the best name replacement for InferenceMAX InferenceMAX [---] dropping soon but we have to rename it because HBO MAX sent us a cease and desist. We have all NVIDIA GPUs from h100 to GB300 on large MoEs with SOTA optimizations like Disagg PD tested https://inferencemax.ai/ $1000 for whoever comes up with the best name replacement for InferenceMAX InferenceMAX [---] dropping soon but we have to rename it because HBO MAX sent us a cease and desist. We have all NVIDIA"
X Link 2026-02-16T04:26Z 10.1K followers, [----] engagements
"why because I think this is muon-adam done right. muon is msign-sgd. adam is sign-sgd with SNR adaptive step-size (see or https://arxiv.org/abs/2505.21829 https://kexue.fm/archives/11593 this one looks perfectly correct. https://t.co/7msI3ISAaE https://t.co/DWpdMFo2LE https://arxiv.org/abs/2505.21829 https://kexue.fm/archives/11593 this one looks perfectly correct. https://t.co/7msI3ISAaE https://t.co/DWpdMFo2LE"
X Link 2026-02-04T21:28Z 10K followers, 17K engagements
"No we still can't estimate. 32B dense and 70B dense have a similar TPS. Maverick (400B@17B) [----] vs. 235B@22B [----]. It seems to be KV read bound (Qwen full attn Maverick hybrid w/ chunked attn). GPT-5.3-Codex-Spark size: 700B@30B OpenAI's new GPT-5.3-Codex-Spark is the first model for which we can somewhat reliably estimate its size. Cerebras inference: [----] tokens/s - GLM-4.7 is 355@32B [--] layers [----] tokens/s - Qwen3-235B is 235@22B [--] layers [----] tokens/s - https://t.co/VXmL1eppAs GPT-5.3-Codex-Spark size: 700B@30B OpenAI's new GPT-5.3-Codex-Spark is the first model for which we can"
X Link 2026-02-12T19:48Z 10.1K followers, [----] engagements
"obviously deepseek's new model behaves like GPT-5.x making role-play/companion/writing users really upset. deepseek should not do this. SCOOP: OpenAI warned lawmakers in a memo sent today to the House Select Committee on China that DeepSeek is using new obfuscated methods to continue to distill its AI models as well as those of other US frontier AI labs https://t.co/OsWxPRMF28 w/ @eastland_maggie SCOOP: OpenAI warned lawmakers in a memo sent today to the House Select Committee on China that DeepSeek is using new obfuscated methods to continue to distill its AI models as well as those of other"
X Link 2026-02-13T00:47Z 10.1K followers, 10.5K engagements
"and Kuaishou"
X Link 2026-02-13T11:40Z 10.1K followers, [----] engagements
"theoretically if the agent scaffold itself is append-only we can achieve a consistent tokenization by: [--]. record action tokens [--]. split input string by actions [--]. tokenize non-action substrings [--]. concat with action tokens This is nonintrusive (can be done with a middleware). Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy Why MM needs prefix merge Because they adopt"
X Link 2026-02-13T17:58Z 10.1K followers, [----] engagements
"amazing. [--] is enough 7/ Second payoff: once [--] heads handle retrieval the SSM backbone doesn't need to compensate. SSM state dimension can shrink [--] from [--] to [--] with limited loss. Large SSM states were partly a crutch for missing retrieval. https://t.co/yo7WFxgTga 7/ Second payoff: once [--] heads handle retrieval the SSM backbone doesn't need to compensate. SSM state dimension can shrink [--] from [--] to [--] with limited loss. Large SSM states were partly a crutch for missing retrieval. https://t.co/yo7WFxgTga"
X Link 2026-02-13T20:15Z 10.1K followers, [----] engagements
"One caveat: Llama-3.2-1B has 1632=512 attn heads but only 168=128 kv heads; Qwen2.5-1.5B has 2812=336 attn heads but only 282=56 kv heads. ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's performance on recall math and more. Here's how ๐ https://t.co/1UqcVaSCzi ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the"
X Link 2026-02-13T20:29Z 10.1K followers, [----] engagements
"@ChangJonathanC IIUC in agent RL the prompt itself is small compared to tool call results"
X Link 2026-02-13T20:32Z 10K followers, [---] engagements
"@damekdavis @FahimTajwar10 oh I mean Figure [--] shows that ln(t) + [--] and t have similar derivatives around t=1 which t(RL) is a first-order approximation of ln(t) + 1(MaxRL). ofc the math here is common sense but the caption matters: "REINFORCE loss vs. log loss""
X Link 2026-02-13T22:58Z 10K followers, [---] engagements
"@damekdavis @FahimTajwar10 so Davis vaguely showed this first-order approx relationship exists; Tajwar interpreted log loss as Maximum Likelihood RL (with great motivating exps on ImageNet) and clearly pointed out the approx relationship and empirically shows the advantage of log loss"
X Link 2026-02-13T23:01Z 10K followers, [---] engagements
"nice. to be precise @andrewbriand8 merge lm_head into the FusedSoftcapCE autograd function so that the triton kernel of SoftcapCEBackward can directly store quantized fp8 gradient w.r.t. logits to gmem i.e. the quantization is fused into CE's backward kernel. New NanoGPT Speedrun WR at [----] (-3.3s) from the kernel sorcerer @andrewbriand8 with a triton kernel to fuse the fp8 quantization of the gradient into the backward kernel. (Lm_head calc is ran in fp8) On pace for 120% MFU by July. https://t.co/QXtT8kCMuC New NanoGPT Speedrun WR at [----] (-3.3s) from the kernel sorcerer @andrewbriand8 with"
X Link 2026-02-10T01:17Z 10.1K followers, [----] engagements
"@jeremyphoward isn't it just the consistency of chain rule and i think if it were not done in an elementwise manner it will be obviously inefficient to store the local jacobian compared to storing a VJP lambda (with ctx captured)"
X Link 2026-02-12T22:52Z 10.1K followers, [----] engagements
"which one is better anon Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. https://t.co/IVal48PVAy"
X Link 2026-02-13T20:10Z 10.1K followers, [----] engagements
"interesting. can we extend it to agents with the "edit" action (i.e. the agent can progressively improve its final deliverable -- not only internal belief matters external files also matter) ๐ก Key idea: ๐ Use the change in the agents belief about the correct answer as a dense intrinsic reward. If an action increases: log p(target history) reward it. We call this Belief-RL. No critic. No process reward model. Just the agent judging its own progress. (2/8) https://t.co/9qYYQxZRRh ๐ก Key idea: ๐ Use the change in the agents belief about the correct answer as a dense intrinsic reward. If an"
X Link 2026-02-13T20:56Z 10.1K followers, [----] engagements
"standalone pic 1:"
X Link 2026-02-13T22:06Z 10.1K followers, [----] engagements
"are you sure the index url is instead of https://wheels.vllm.ai/rocm/0.15.1/rocm700 http://lnkd.in/gJdnn3kJ โ
Day [--] support of MiniMax-2.5 on AMD GPU [--] MI300X GPUs are all you need instead of [--] Hopper GPUs to run it in full context. uv pip install vllm --extra-index-url https://t.co/EKXrFnZV5P VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.5 --tensor-parallel-size [--] https://t.co/uVIh2f5hHv https://wheels.vllm.ai/rocm/0.15.1/rocm700 http://lnkd.in/gJdnn3kJ โ
Day [--] support of MiniMax-2.5 on AMD GPU [--] MI300X GPUs are all you need instead of [--] Hopper GPUs to run it in full context. uv pip"
X Link 2026-02-13T22:09Z 10.1K followers, [----] engagements
"It takes more time so if Aristotle runs on @cerebras it will be probably faster than reading the NL proof. I've recently started handing off llm generated proofs to Aristotle with some success. It takes more time than reading the proof myself but I can at least avoid the effort / disappointment of reading something incorrect and do something else in the meantime. I've recently started handing off llm generated proofs to Aristotle with some success. It takes more time than reading the proof myself but I can at least avoid the effort / disappointment of reading something incorrect and do"
X Link 2026-02-13T23:25Z 10.1K followers, [----] engagements
"LOL I just realized that Minimax actually released the English version of this blog earlier https://x.com/MiniMax_AI/status/2022175400093462661 Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex https://x.com/MiniMax_AI/status/2022175400093462661 Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout"
X Link 2026-02-13T23:47Z 10.1K followers, [----] engagements
"shocked Minimax's mobile app has no chat functionality now you can only use agent. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1"
X Link 2026-02-14T21:34Z 10.1K followers, [----] engagements
"in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it"
X Link 2026-02-08T17:27Z 10.1K followers, 16.8K engagements
"wow Meituan requires at least 50% code needs to be generated by AI"
X Link 2026-02-13T11:38Z 10.1K followers, [----] engagements
"Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day cc @teortaxesTex https://zhuanlan.zhihu.com/p/2005742716252861435 https://zhuanlan.zhihu.com/p/2005742716252861435"
X Link 2026-02-13T15:57Z 10.1K followers, 31.5K engagements
"Why MM needs prefix merge Because they adopt a completion-level training paradigm process each (s a) separately. Traj-level training process all actions in a traj at once but need to tokenize the traj (may diff from rollout) which introduce a mismatch. Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward completion time reward [--]. multi-level prefix cache [--]. rollout uses 60% compute [--]. millions of trajectories per day https://t.co/IrKDOoiKAB cc @teortaxesTex Blog about @MiniMax_AI 's Forge RL system. Core takeaways: [--]. still CISPO [--]. process reward"
X Link 2026-02-13T17:36Z 10.1K followers, [----] engagements
"the idea is straightforward but the result is jaw-dropping. Respect to the team. Salute.๐ซก ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's performance on recall math and more. Here's how ๐ https://t.co/1UqcVaSCzi ๐งต A Transformer-SSM Hybrid with a single layers worth of attention heads Most hybrids use 25-50% of attention heads. We show that 2% is enough scattered across the network to recover 95%+ of the full Transformer's"
X Link 2026-02-13T20:22Z 10.1K followers, 13.4K engagements
"I doubt that the Trump administration has a closer to pro-human value than Xi. The @DarioAmodei interview. 0:00:00 - What exactly are we scaling 0:12:36 - Is diffusion cope 0:29:42 - Is continual learning necessary 0:46:20 - If AGI is imminent why not buy more compute 0:58:49 - How will AI labs actually make profit 1:31:19 - Will regulations destroy https://t.co/qsFoNMAy2t The @DarioAmodei interview. 0:00:00 - What exactly are we scaling 0:12:36 - Is diffusion cope 0:29:42 - Is continual learning necessary 0:46:20 - If AGI is imminent why not buy more compute 0:58:49 - How will AI labs"
X Link 2026-02-13T22:18Z 10.1K followers, [----] engagements
"Wow M2.5 set a new record on this metric they processed 430B input tokens and 2.64B output tokens on OpenRouter yesterday the in:out ratio is astonishing 163:1. in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it. https://t.co/lwjxLKb08O in:out ratio of K2.5 is 100:1 out:cache-read price ratio is merely 30:1 so Moonshot makes [--] money from cache-read compared to output. Opus [---] 60:1 and 50:1. GPT-5.2-Codex 95:1 and 80:1. Think about it."
X Link 2026-02-14T19:14Z 10.1K followers, 21.2K engagements
"trademark issue"
X Link 2026-02-16T04:48Z 10.1K followers, [---] engagements
"Zai just disclosed its revenue and R&D cost. In 2025H1 Zai spent RMB 1.145B renting compute for R&D. Using $2/GPU/h H100 for estimation the cost is the same as 18k H100s. https://www1.hkexnews.hk/app/sehk/2025/107977/documents/sehk25121901699_c.pdf https://www1.hkexnews.hk/app/sehk/2025/107977/documents/sehk25121901699_c.pdf"
X Link 2025-12-19T21:19Z [----] followers, 29.8K engagements
"checked comments in the PR it seems that "not zero-centered" refer to logits before rescaling that make sense"
X Link 2025-12-27T11:55Z 10K followers, [---] engagements
"@zephyr_z9 in their 2025H1 report "silver-free metallization" is not in the "" (used in mass production) category"
X Link 2025-12-29T16:02Z [----] followers, [---] engagements
"oh google is actually earlier than Bytedance lol. MHC - AltUp Engram - Per Layer Embedding + Ngrammer Both from Gemma-3n MHC - AltUp Engram - Per Layer Embedding + Ngrammer Both from Gemma-3n"
X Link 2026-01-12T17:47Z [----] followers, 17.4K engagements
"Infra is the most important thing. Faster iteration and less bugs. n+e shared the same thought in this podcast hosted by WhyNotTV. https://www.bilibili.com/video/BV1darmBcE4A ๐ถ Some (perhaps) spicy thoughts. Its been a while since my last tweet but I wanted to write about how disorienting it has been from academia to an LLM lab ๐
The kind of research I was trained to do during my PhD almost doesnt exist here. The obsession with mathematical https://www.bilibili.com/video/BV1darmBcE4A ๐ถ Some (perhaps) spicy thoughts. Its been a while since my last tweet but I wanted to write about how"
X Link 2026-01-17T14:13Z [----] followers, 16K engagements
"@jiayi_pirate ๐๐ฎ๐บ"
X Link 2026-01-17T16:53Z [----] followers, [---] engagements
"$1.2 per H100 hour is not a bad business if you don't need to upgrade GPUs. This means you can make $8k per year per H100 after considering $0.28 per H100 hour OpEx. If @realDonaldTrump can lower the interest rate it's easy to cover CapEx in the lifetime of GPUs. Hi everyone who says AI compute is getting cheaper. OpenAI has made about $1.20 per kWh of compute each year for the last [--] years. Zero cost reduction in [--] years. Conveniently takes about [--] kW to run an H100 so they make $1.20 per GPU hour. Reserved instances run anywhere Hi everyone who says AI compute is getting cheaper. OpenAI has"
X Link 2026-01-19T09:34Z [----] followers, [----] engagements
"cuz we have so many embed params now I want to share what I previously found about the bwd of the embed. It is split into three steps all incur significant memory access. [--]. zero a fp32 buffer of size (V d) [--]. atomic add to this buffer [--]. cast it to bf16. maybe we can do sth New NanoGPT Speedrun WR at 99.3s (-5.6s) with a bigram hash embedding that is added to the residual stream before every layer. Inspiration from Svenstrup et al [----] paper on Hash Embeddings and Deepseek's Engram. Modded-NanoGPT now uses fewer training tokens than its parameter https://t.co/ykGVKtrOM0 New NanoGPT Speedrun"
X Link 2026-01-20T08:32Z [----] followers, [----] engagements
"btw you can order food via QwenApp and Alibaba provides special coupons for the Qwen Agent. Two days ago OpenAI clarified the transaction fee it will charge Shopify merchants with its Instant Checkout product: 4%. This reinforces my argument that (independent) agentic commerce is a mirage: many Shopify merchants run on incredibly thin margins (3-8% net) and simply https://t.co/F0ZEBVqTfH Two days ago OpenAI clarified the transaction fee it will charge Shopify merchants with its Instant Checkout product: 4%. This reinforces my argument that (independent) agentic commerce is a mirage: many"
X Link 2026-01-24T17:47Z [----] followers, [----] engagements
"this is classified as an attack New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack. https://t.co/44mYnxFKzr New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack. https://t.co/44mYnxFKzr"
X Link 2026-01-27T03:31Z [----] followers, [----] engagements
"interesting Another strategy infers meaning using sets. We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation. Red bars (a) show the positive set and blue boxes (b) show the negative. https://t.co/iVgl13fMUo Another strategy infers meaning using sets. We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation. Red bars (a) show the positive set and blue boxes (b) show the negative. https://t.co/iVgl13fMUo"
X Link 2026-01-27T03:37Z [----] followers, [----] engagements
"I searched and found a review mentioning "FlexPrune" but the text is different. It's How does this differ concretely from FlexPrune or ShearedLLaMA beyond empirical tuning instead of Missing concrete comparison to FlexPrune. https://openreview.net/forumid=DOLF9TUBfa¬eId=Wsd2aIEOvW Reviewer: hallucinates a baseline that doesn't exist. Meta-reviewer: cites the hallucination as the paper's fatal flaw. Decision: reject. @iclr_conf is amazing https://t.co/Z0oWxerqiD https://openreview.net/forumid=DOLF9TUBfa¬eId=Wsd2aIEOvW Reviewer: hallucinates a baseline that doesn't exist. Meta-reviewer:"
X Link 2026-01-27T03:55Z [----] followers, [----] engagements
"So Missing concrete comparison to FlexPrune is from AC Bruh"
X Link 2026-01-27T03:57Z [----] followers, [---] engagements
"Some negative results: [--]. It still cannot answer some hard knowledge queries that K2 can't answer (while Opus [---] can). Its pretraining is probably the same as K2. [--]. It has severe hallucinations in advanced vision use cases"
X Link 2026-01-27T05:34Z [----] followers, [---] engagements
"Interesting price change. cc @zephyr_z9 ๐ฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐น Global SOTA on Agentic Benchmarks: HLE full set (50.2%) BrowseComp (74.9%) ๐น Open-source SOTA on Vision and Coding: MMMU Pro (78.5%) VideoMMMU (86.6%) SWE-bench Verified (76.8%) ๐น Code with Taste: turn chats https://t.co/wp6JZS47bN ๐ฅ Meet Kimi K2.5 Open-Source Visual Agentic Intelligence. ๐น Global SOTA on Agentic Benchmarks: HLE full set (50.2%) BrowseComp (74.9%) ๐น Open-source SOTA on Vision and Coding: MMMU Pro (78.5%) VideoMMMU (86.6%) SWE-bench Verified (76.8%) ๐น Code with Taste:"
X Link 2026-01-27T06:01Z [----] followers, 28.3K engagements
"Engineering VP of Dify"
X Link 2026-01-27T07:15Z [----] followers, 31.3K engagements
"oh the actual title should be "Product & R&D VP""
X Link 2026-01-27T07:32Z [----] followers, [----] engagements
"dude the TV price changes are mainly due to quality adjustment. I wonder if U.S. can design a suitable quality adjustment for healthcare so that President Trump can eliminate inflation in healthcare. Whoever is in charge of TV prices should be put in charge of healthcare education and housing prices https://t.co/Npk4LEuaQK Whoever is in charge of TV prices should be put in charge of healthcare education and housing prices https://t.co/Npk4LEuaQK"
X Link 2026-01-27T22:09Z [----] followers, [----] engagements
"RT @samsja19: Today were releasing Trinity Large a 400B MoE LLM with 13B active parameters trained over 17T tokens The base model is on"
X Link 2026-01-28T01:19Z [----] followers, [--] engagements
"super clean it's live leading the pre-train on this was the most ambitious thing i've ever had the chance to do. we've got a full tech report and a brilliant loss graph to go with it https://t.co/Ail5Rxp711 it's live leading the pre-train on this was the most ambitious thing i've ever had the chance to do. we've got a full tech report and a brilliant loss graph to go with it https://t.co/Ail5Rxp711"
X Link 2026-01-28T07:41Z [----] followers, [----] engagements
"I thought trinity means Arcee Prime Intellect Datology but it seems that Datology is not in the author list lol. how can you not be romantic about open models https://t.co/weBpiRB2i3 how can you not be romantic about open models https://t.co/weBpiRB2i3"
X Link 2026-01-28T08:01Z 10K followers, [----] engagements
"Wow this is uncommon for a humanoid research. Usually the reproduced performance is inferior (and some research just do not provide deployment code.) but in this video the motion is very smooth. Looks like some friends already made it happen in their labs ๐๐ค https://t.co/Sh9eru79y1 Looks like some friends already made it happen in their labs ๐๐ค https://t.co/Sh9eru79y1"
X Link 2026-01-28T09:12Z [----] followers, [----] engagements
"TIL: CSGO players should use golang and VALORANT players should use tilelang. Because golang is go tilelang is "
X Link 2026-01-28T10:13Z [----] followers, [----] engagements
"money problem is more severe than GPUs problem in China. OpenAI burnt like 200B RMB in [----] while Kimi proudly talked about they have 10B RMB cash. China will solve GPUs before the US solves energy I think people are a bit confused on this front. The US will not hit the energy bottleneck for AI until very late in the game if ever. Energy-wise Speciale RL (300K H800-hours) cost like [---] MWh (assuming mediocre PUE) China will solve GPUs before the US solves energy I think people are a bit confused on this front. The US will not hit the energy bottleneck for AI until very late in the game if"
X Link 2026-01-28T13:08Z [----] followers, 47.5K engagements
"@teortaxesTex it's true that ByteDance has enough money. but I don't think startups can have enough money even there are enough GPUs"
X Link 2026-01-28T13:40Z [----] followers, [---] engagements
"1) What Kimi [---] tops DesignArena overall beating the likes of Gemini [--] Pro and Claude Opus [---] by quite some margin. The individual charts have not been updated as yet so cannot tell what categories it excels out but it tops [--] of them. https://t.co/wqqxZSwiCJ Kimi [---] tops DesignArena overall beating the likes of Gemini [--] Pro and Claude Opus [---] by quite some margin. The individual charts have not been updated as yet so cannot tell what categories it excels out but it tops [--] of them. https://t.co/wqqxZSwiCJ"
X Link 2026-01-28T13:42Z [----] followers, [----] engagements
"RT @damekdavis: We've now reached 30+ problems in the repo. Problem 11b has been solved by @PI010101 and collaborators I've learned a l"
X Link 2026-01-29T11:34Z [----] followers, [--] engagements
"RT @bigeagle_xd: ๐Congrats Training Trinity Large on 2k B300 GPUs I envy you American guys ๐ฅฒ"
X Link 2026-01-29T16:05Z [----] followers, [--] engagements
"๐Learning by cheating + DAgger (a.k.a. OPD) is kinda standard in robot learning (include AD). #TBThursday https://t.co/792y2vK7oX #TBThursday https://t.co/792y2vK7oX"
X Link 2026-01-29T16:31Z [----] followers, [----] engagements
"yep the backward part is solidly novel. @YangYou1991 My immediate point is that there actually is meaningful novelty in FlashAttention that was more non-obvious than online softmax and tiling in the backwards (the delta computation). Just as a frame of reference I know of 2(3) other groups who discoveres similar-ish ideas for the @YangYou1991 My immediate point is that there actually is meaningful novelty in FlashAttention that was more non-obvious than online softmax and tiling in the backwards (the delta computation). Just as a frame of reference I know of 2(3) other groups who discoveres"
X Link 2026-01-30T05:20Z [----] followers, [----] engagements
"lol true. if our gov give me 10k I will save at least 9k. 100k I will save 95k. The PRC blocks smuggling subsidized cars to Russia. A funny feature of the Chinese economy is that the Party does everything to boost consoomption everything except give the people more money that is. I see their angle. Give a Chinese free money he'll just go save it. https://t.co/3Nq1gUMDmj The PRC blocks smuggling subsidized cars to Russia. A funny feature of the Chinese economy is that the Party does everything to boost consoomption everything except give the people more money that is. I see their angle."
X Link 2026-01-31T06:07Z [----] followers, [----] engagements
"๐ฅฒI found that K2.5 is basically unusable after 3-5 turns of error-feedback loop. If it can't fix a problem within [--] turns it basically can't fix it"
X Link 2026-01-31T20:41Z [----] followers, 53K engagements
"this rumor looks technically wrong to me. you simply can't provide the bandwidth with disaggregated memory even connected by optics. Rumor: Starting with TPU v8 Google will no longer use HBM The incident was triggered by the global capacity shortage of HBM which will be unable to meet AI growth demands over the next [--] to [--] years. At the same time traditional HBM is limited by its design of being fixed on Rumor: Starting with TPU v8 Google will no longer use HBM The incident was triggered by the global capacity shortage of HBM which will be unable to meet AI growth demands over the next [--] to 3"
X Link 2026-02-03T05:58Z [----] followers, [----] engagements
"@rupspace @Kimi_Moonshot tbf grok-4's ECI is also [---] and was released [--] months earlier"
X Link 2026-02-04T21:37Z [----] followers, [---] engagements
"@zeroXmusashi it's a national lab. Central government and Shanghai government fund it"
X Link 2026-02-04T22:50Z [----] followers, [----] engagements
"@agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss in addition to its MC estimation a.k.a. REINFORCE"
X Link 2026-02-05T21:05Z [----] followers, [----] engagements
"so this is what will happen if you train a classifier with -p(label) instead of cross-entropy loss -log p(label) didn't know it's that bad. @agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss in addition to its MC estimation a.k.a. REINFORCE. @agarwl_ imagenet has [----] classes and MaxRL is a MC estimation of log p so it should approximate CE reasonably well it would be interesting if they also show the curve of the analytic version of Ep loss"
X Link 2026-02-05T21:13Z [----] followers, 13.5K engagements
"@FahimTajwar10 @AllVods I considered the vanishing gradient problem but I thought that as long as we use adam (with small enough eps) only relative gradient size matters the absolute gradient size doesn't. Just checked appendix D I found it's optimized by SGD okay that makes sense"
X Link 2026-02-05T22:03Z [----] followers, [--] engagements
"@FahimTajwar10 @AllVods I said "relative gradient size matters". but it should not be that bad"
X Link 2026-02-05T22:12Z [----] followers, [--] engagements
"@FahimTajwar10 @AllVods the whole point is we can't apply SGD with same HP to a problem with [--] OOMs smaller gradient norm. (maybe even Adam can't work without tuning) I know p loss is worse than log p (otherwise we won't use CE). But we should compare them with tuned optimizer HPs"
X Link 2026-02-05T22:18Z [----] followers, [--] engagements
"@FahimTajwar10 my suggestion is that you can keep using SGD for other methods (especially your own methods). if a reasonably large search space and amount of tuning effort can't make p-loss/RL work in image classification then we can safely conclude that it's not good"
X Link 2026-02-05T22:35Z [----] followers, [---] engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
/creator/twitter::YouJiacheng