# ![@maharshii Avatar](https://lunarcrush.com/gi/w:26/cr:twitter::1259122686896287744.png) @maharshii maharshi

maharshi posts on X about inference, if you, this is, faster the most. They currently have [------] followers and [---] posts still getting attention that total [------] engagements in the last [--] hours.

### Engagements: [------] [#](/creator/twitter::1259122686896287744/interactions)
![Engagements Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1259122686896287744/c:line/m:interactions.svg)

- [--] Week [------] -62%
- [--] Month [-------] -27%
- [--] Months [---------] -52%
- [--] Year [----------] -79%

### Mentions: [--] [#](/creator/twitter::1259122686896287744/posts_active)
![Mentions Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1259122686896287744/c:line/m:posts_active.svg)

- [--] Month [--] -74%
- [--] Months [---] -35%
- [--] Year [---] -44%

### Followers: [------] [#](/creator/twitter::1259122686896287744/followers)
![Followers Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1259122686896287744/c:line/m:followers.svg)

- [--] Week [------] -0.07%
- [--] Month [------] -0.03%
- [--] Months [------] +6.50%
- [--] Year [------] +28%

### CreatorRank: [-------] [#](/creator/twitter::1259122686896287744/influencer_rank)
![CreatorRank Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::1259122686896287744/c:line/m:influencer_rank.svg)

### Social Influence

**Social category influence**
[countries](/list/countries)  3% [finance](/list/finance)  2% [social networks](/list/social-networks)  1% [technology brands](/list/technology-brands)  1%

**Social topic influence**
[inference](/topic/inference) 8%, [if you](/topic/if-you) 6%, [this is](/topic/this-is) 5%, [faster](/topic/faster) 5%, [code](/topic/code) 4%, [more than](/topic/more-than) 4%, [books](/topic/books) #1901, [smartly](/topic/smartly) #44, [nano banana](/topic/nano-banana) 3%, [what is](/topic/what-is) 2%

**Top accounts mentioned or mentioned by**
[@mobicham](/creator/undefined) [@drisspg](/creator/undefined) [@gaunernst](/creator/undefined) [@leik0w0](/creator/undefined) [@avalutionary](/creator/undefined) [@snowclipsed](/creator/undefined) [@ailker](/creator/undefined) [@dejavucoder](/creator/undefined) [@sasuke420](/creator/undefined) [@cheatyyyy](/creator/undefined) [@rajvishah30](/creator/undefined) [@fal](/creator/undefined) [@isidentical](/creator/undefined) [@burkaygur](/creator/undefined) [@dprophecyguy](/creator/undefined) [@caneystudios](/creator/undefined) [@svarunid](/creator/undefined) [@alpindale](/creator/undefined) [@pdt2211](/creator/undefined) [@adsha44](/creator/undefined)
### Top Social Posts
Top posts by engagements in the last [--] hours

"sometimes youre just scrolling and then boom you get hit by the deepest most thought provoking sentence you pity the moth confusing a lamp for the moon yet here you are confusing a screen for the world you pity the moth confusing a lamp for the moon yet here you are confusing a screen for the world"  
[X Link](https://x.com/maharshii/status/2019821660254413196)  2026-02-06T17:13Z 42.2K followers, 72.4K engagements


"the concept of time does not exist in the 4th dimension and inside an airport"  
[X Link](https://x.com/maharshii/status/2020272333160886289)  2026-02-07T23:03Z 42.2K followers, [----] engagements


"this was an eye-opening [--] hour [--] minutes video none of what he explained would contradict a reasonable mind"  
[X Link](https://x.com/maharshii/status/2018619854832767398)  2026-02-03T09:37Z 42.2K followers, 226.3K engagements


"im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many people may not be aware about it so go read https://blog.fal.ai/instruction-level-control-with-inline-elementwise-asm-in-triton/ https://blog.fal.ai/instruction-level-control-with-inline-elementwise-asm-in-triton/"  
[X Link](https://x.com/maharshii/status/2021266717641474194)  2026-02-10T16:55Z 42.2K followers, 10.6K engagements


"i explain the usage of inline elementwise asm in triton using [--] examples in the blog post: - single instruction (rcp) - multiple instructions and packing - nvfp4 quantization kernel read and let me know im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many people may not be aware about it so go read https://t.co/4CskBUOSlg im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many"  
[X Link](https://x.com/maharshii/status/2021580938694582557)  2026-02-11T13:43Z 42.2K followers, [----] engagements


"train and inference GPT in [--] line of python code: exec("import osmathrandomurllib.request as u;random.seed(1);e=16;b=8;s=1000nif not 'i')nt=open('i').read().split();c='''$'+sorted(set(''.join(t)));n=len(c);d=x:i for ix in enumerate(c);r=i:x for xi in d.items()nclass V:n def __init__(axp=()):a.x=x;a.g=0;a.p=pn def __add__(ao):o=o if isinstance(oV) else V(o);z=V(a.x+o.x(ao));z.b=lambda:(setattr(a'g'a.g+z.g)setattr(o'g'o.g+z.g));return zn def __mul__(ao):o=o if isinstance(oV) else V(o);z=V(a.x*o.x(ao));z.b=lambda:(setattr(a'g'a.g+o.x*z.g)setattr(o'g'o.g+a.x*z.g));return zn def"  
[X Link](https://x.com/maharshii/status/2021998062374449512)  2026-02-12T17:21Z 42.2K followers, 13.1K engagements


"the butterfly the sun the books he is standing on the universe he himself everything is god https://t.co/ZsJj6jVcck https://t.co/ZsJj6jVcck"  
[X Link](https://x.com/maharshii/status/2022263340018565361)  2026-02-13T10:55Z 42.2K followers, [----] engagements


"i think instead of doing courses for technical subjects people should devote some more time and learn things from books. if you want to go into the trenches of any technical subject the amount of knowledge and detail a book can give is unmatched"  
[X Link](https://x.com/maharshii/status/1919026780398035173)  2025-05-04T13:50Z 42.2K followers, 40K engagements


"i have a secret to get to 80-100LPA in india and its not DSA some people might guess what it is. DSA is still the only source of truth if you want to earn 40-50LPA as a fresher in the tech industry. DSA is still the only source of truth if you want to earn 40-50LPA as a fresher in the tech industry"  
[X Link](https://x.com/maharshii/status/1975901966837617013)  2025-10-08T12:31Z 42.2K followers, 393.1K engagements


"@mobicham yes true but will the lookup table method be faster than the instruction here (if you have experimented with that)"  
[X Link](https://x.com/maharshii/status/2021929093462462568)  2026-02-12T12:47Z 42.2K followers, [---] engagements


"@mobicham i think this is for gemv do you have a triton kernel for quantizing to fp4 with lookup table or you are writing it in pytorch and doing fullgraph compile"  
[X Link](https://x.com/maharshii/status/2021938984532987937)  2026-02-12T13:26Z 42.2K followers, [--] engagements


"@mobicham i see will test which one is faster for sm100"  
[X Link](https://x.com/maharshii/status/2021943067478270224)  2026-02-12T13:42Z 42.2K followers, [--] engagements


"@snowclipsed conditional on the shapes or speed yeah its kinda absurd but thats all we have so 😭"  
[X Link](https://x.com/maharshii/status/2022019528662954106)  2026-02-12T18:46Z 42.2K followers, [---] engagements


"@ailker roll credits 😂"  
[X Link](https://x.com/maharshii/status/2022078353126367390)  2026-02-12T22:40Z 42.2K followers, [---] engagements


"@dejavucoder yessir i was here 2h ago"  
[X Link](https://x.com/maharshii/status/2022542919824216288)  2026-02-14T05:26Z 42.2K followers, [----] engagements


"imagine showing this to a caveman"  
[X Link](https://x.com/maharshii/status/2021999318505599112)  2026-02-12T17:26Z 42.2K followers, [----] engagements


"the bad news is that nothing matters the good news is that nothing matters"  
[X Link](https://x.com/maharshii/status/2022879811505656211)  2026-02-15T03:45Z 42.2K followers, [----] engagements


"many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this"  
[X Link](https://x.com/maharshii/status/2019039331718152329)  2026-02-04T13:24Z 42.2K followers, 39.1K engagements


"this is not bad at all many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx"  
[X Link](https://x.com/maharshii/status/2019429187271217193)  2026-02-05T15:13Z 42.2K followers, [----] engagements


"this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for smaller shapes there's some overhead that i need to eliminate many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i"  
[X Link](https://x.com/maharshii/status/2019777040787103748)  2026-02-06T14:15Z 42.2K followers, 10.5K engagements


"wrote a custom CUDA kernel that uses 256-bit gmem loads starting from version [----] on blackwell and for smaller shapes it is indeed faster than triton until the shape (8192 8192). it still seems to be slower on larger shapes i bet there is room for more speed. this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for smaller shapes there's some overhead that i need to eliminate https://t.co/w2nuw9DJ7P this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for"  
[X Link](https://x.com/maharshii/status/2021241686031008119)  2026-02-10T15:15Z 42.2K followers, [----] engagements


"it is a short n sweet kernel (200 lines) including all the inline PTX shenanigans :D"  
[X Link](https://x.com/maharshii/status/2021242405689733333)  2026-02-10T15:18Z 42.2K followers, [---] engagements


"seedance [---] has passed the uncanny valley for me its so good i wanna see what kind of dataset is it trained on"  
[X Link](https://x.com/maharshii/status/2021549823321886755)  2026-02-11T11:40Z 42.2K followers, 18.4K engagements


"Always inspect your PTX. For context: below is the instruction used for computing GEMM of nvfp4 x nvfp4 A and B. The instruction does [--] CTA GEMM and block16 is alias for scale_vec_size::4X. It delivers 4X more throughput on blackwell compared to FP8 tensor cores on hopper"  
[X Link](https://x.com/maharshii/status/2022014340807966975)  2026-02-12T18:26Z 42.2K followers, [----] engagements


"i love japanese food"  
[X Link](https://x.com/maharshii/status/2022531534604112117)  2026-02-14T04:41Z 42.2K followers, 20.9K engagements


"what are some fun and underrated things to do in blr kormangala considering im here solo for 2-3 days"  
[X Link](https://x.com/maharshii/status/2022558332171497532)  2026-02-14T06:27Z 42.2K followers, 11.1K engagements


"bit of chill time"  
[X Link](https://x.com/maharshii/status/2023487675433709779)  2026-02-16T20:00Z 42.2K followers, [----] engagements


"i wont be sharing it as of now because i need to check the accuracy of quant - dequant round trip on more matrices along with some other stuff :)"  
[X Link](https://x.com/mrsiipa/status/1985335344615874718)  2025-11-03T13:16Z 41.9K followers, [----] engagements


"@drisspg @gaunernst can you help me understand: i am able to pass input_scale and weight_scale to _scaled_mm but i don't see a way to pass the global scale (needed for nvfp4) what is the correct way to pass it because scale_result has no effect on it"  
[X Link](https://x.com/mrsiipa/status/1985718222277292141)  2025-11-04T14:38Z 41.9K followers, [---] engagements


"update: now using cublas instead of cutlass and i already see a big jump it achieves [---] PFLOPS with quantization overhead (thanks to fast quant kernel) and [---] PFLOPS for the gemm alone pretty cool. maybe i'm doing something wrong but i don't see NVFP4 gemm anywhere close to [--] TFLOPS even with CUTLASS on SM100 does anyone know why this is the case and what else can be done https://t.co/TvOtVkvlov maybe i'm doing something wrong but i don't see NVFP4 gemm anywhere close to [--] TFLOPS even with CUTLASS on SM100 does anyone know why this is the case and what else can be done"  
[X Link](https://x.com/mrsiipa/status/1985728016765632530)  2025-11-04T15:17Z 41.9K followers, 11.5K engagements


"@Leik0w0 @drisspg @gaunernst sure i actually might the link which @drisspg sent is pretty self explanatory by itself. i had to use torch nightly though. the accuracy is good too"  
[X Link](https://x.com/mrsiipa/status/1985729424764125242)  2025-11-04T15:22Z 41.9K followers, [---] engagements


"@drisspg @Leik0w0 @gaunernst just curious: its available only in torch nightly or the stable versions as well"  
[X Link](https://x.com/maharshii/status/1985733361751769297)  2025-11-04T15:38Z 41.9K followers, [---] engagements


"another interesting update: for irregular shapes like below that we see in practice the cutlass kernel that i have actually beats cublas on NVFP4"  
[X Link](https://x.com/mrsiipa/status/1986012708433719519)  2025-11-05T10:08Z 41.9K followers, [----] engagements


"made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it can go even faster"  
[X Link](https://x.com/maharshii/status/1986122938668782002)  2025-11-05T17:26Z 41.9K followers, 31.9K engagements


"NVFP4 quantization procedure for those who care: In essence it is a block-scaling method so a block of [--] consecutive elements gets assigned a scale factor. Given a tensor x each block b (= 16) of consecutive elements in higher precision gets quantized to NVFP4. There are two scale factors at play here: Global and Local. The global scale factor is a single FP32 value that moves the range of values that a block b can represent (FP4 value x FP8 scale). The local scale moves the individual values x_i to the FP4 representable range. For global scale tensor it is defined as: s_enc = (6 * 448) /"  
[X Link](https://x.com/maharshii/status/1986152319004856491)  2025-11-05T19:23Z 41.9K followers, 19.7K engagements


"i cant believe everything i learned about NVFP4 was in the last [--] days and i managed to calibrate and run Wan [---] with coherent video outputs this is the fastest i have learnt something new (yet)"  
[X Link](https://x.com/mrsiipa/status/1986359084174676202)  2025-11-06T09:04Z 41.9K followers, [----] engagements


"in the current state the end-to-end speedup i get with NVFP4 on Wan [---] is [--] seconds so around 8% overall. in comparison per-row FP8 quantization gave me a speedup of at most [--] seconds. also i overcame the limitation of [---] x [--] tile shape in the quant kernel itself. made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it can go even faster https://t.co/o38KTo9xVB made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it"  
[X Link](https://x.com/maharshii/status/1986431618106663004)  2025-11-06T13:52Z 41.9K followers, [----] engagements


"from triton's block scaled matmul tutorial"  
[X Link](https://x.com/maharshii/status/1986684402236334518)  2025-11-07T06:37Z 41.9K followers, [----] engagements


"the NVFP4 quant kernel achieves 3.2TB/s memory throughput and has a runtime of 33us which seems to be good enough: NCU complains about non-fused FP32 operations but all i see is FMUL so that might be a false alarm. oh wow https://t.co/jFu68EFBk1 oh wow https://t.co/jFu68EFBk1"  
[X Link](https://x.com/maharshii/status/1986786214289301593)  2025-11-07T13:22Z 41.9K followers, [----] engagements


"@sasuke___420 it complains about fmul operations only and i dont see fma ops that can be applied anywhere within the kernel"  
[X Link](https://x.com/mrsiipa/status/1987063772805275840)  2025-11-08T07:44Z 41.9K followers, [--] engagements


"if you only do what you can do youll never be more than what you are now"  
[X Link](https://x.com/maharshii/status/1987151253634490617)  2025-11-08T13:32Z 41.9K followers, [----] engagements


"when you bully someone youre bullying yourself and when you help someone youre helping yourself; when everything comes from the same source the illusion of distinction disappears"  
[X Link](https://x.com/maharshii/status/1987204930382475329)  2025-11-08T17:05Z 41.9K followers, [----] engagements


"why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic"  
[X Link](https://x.com/maharshii/status/1987475939085996052)  2025-11-09T11:02Z 41.9K followers, 41.4K engagements


"update: it feels good to be a bit faster than TensorRT i might explain later (no promises though) why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic. why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic"  
[X Link](https://x.com/maharshii/status/1987583748184089068)  2025-11-09T18:11Z 41.9K followers, [----] engagements


"idk who needs to hear this but you can register forward/backward pre and post hooks in pytorch modules if you want to intercept the values it's quite useful"  
[X Link](https://x.com/maharshii/status/1987884784849473659)  2025-11-10T14:07Z 41.9K followers, 10.7K engagements


"quick question: will torch dynamo specialize ints and floats and 'bake' them into the graph i.e. not put guards on them when we compile with dynamic=True or not"  
[X Link](https://x.com/maharshii/status/1988219234363854959)  2025-11-11T12:16Z 41.9K followers, [----] engagements


"stop whatever you are doing right now and watch Frankenstein on netflix what an absolute masterpiece"  
[X Link](https://x.com/maharshii/status/1989532126879125896)  2025-11-15T03:13Z 41.9K followers, 61.6K engagements


"wow on consumer GPUs the speedups are quite insane with pure cudagraphs: this is very usable for the cases when there are unnecessary kernel launch overheads"  
[X Link](https://x.com/maharshii/status/1989627817852801514)  2025-11-15T09:33Z 41.9K followers, 13.1K engagements


"TMA transfers require 1024-byte alignment so if you wanna make this work for imperfect shapes you will need to manually align the SMEM to [----] bytes and expect only the required bytes for the transfer but this is amazing for educational purposes"  
[X Link](https://x.com/maharshii/status/1990807620526154177)  2025-11-18T15:41Z 41.9K followers, [----] engagements


"we have had more cloud outages ever since AI vibe coding happened what if the way AI improves humanity is by bringing all of the internet down"  
[X Link](https://x.com/maharshii/status/1990889673472815130)  2025-11-18T21:07Z 42.1K followers, [----] engagements


"i love writing my own C++ templates but i hate reading someone elses C++ templates"  
[X Link](https://x.com/maharshii/status/1991056533313237157)  2025-11-19T08:10Z 41.9K followers, [----] engagements


"im insanely deep into the CUDA trenches but im very sure that the gains will be worth it"  
[X Link](https://x.com/maharshii/status/1991556645449789840)  2025-11-20T17:18Z 41.9K followers, 33K engagements


"an underrated part of this is: we can auto-generate CUDA kernels based on some predefined templates and configurations register them globally at load time and then dispatch to the correct kernel at runtime without any overhead. this is how flash attention does it too. im insanely deep into the CUDA trenches but im very sure that the gains will be worth it. im insanely deep into the CUDA trenches but im very sure that the gains will be worth it"  
[X Link](https://x.com/maharshii/status/1991573164678377660)  2025-11-20T18:23Z 42.1K followers, 39K engagements


"in my opinion the best way to use LLMs is to discuss about the problem with them rather than directly asking for the code the mere back and forth of ideas and suggestions will give you much more clarity than trying to one-shot the code"  
[X Link](https://x.com/maharshii/status/1992230943814463563)  2025-11-22T13:57Z 41.9K followers, 17.3K engagements


"nano banana [--] is insane"  
[X Link](https://x.com/maharshii/status/1992668710491009104)  2025-11-23T18:56Z 42.1K followers, 24.3K engagements


"writing an inference engine is a spiritual experience but the feeling is even greater when every building block and every kernel comes together to work like magic"  
[X Link](https://x.com/maharshii/status/1992967977915064822)  2025-11-24T14:46Z 42.1K followers, [----] engagements


"it takes less than [--] second to generate an image with this model on fal thanks to our inference engine the image quality is top-tier as well. 1/ [--] We are pleased to introduce Z-Image an efficient 6-billion-parameter foundation model for image generation. Through systematic optimization it proves that top-tier performance is achievable without relying on enormous model sizes delivering strong results in https://t.co/OhUDL3PfSH 1/ [--] We are pleased to introduce Z-Image an efficient 6-billion-parameter foundation model for image generation. Through systematic optimization it proves that"  
[X Link](https://x.com/maharshii/status/1993959059414999513)  2025-11-27T08:24Z 42.1K followers, [----] engagements


"how does [--] second flux2 inference sound 👀"  
[X Link](https://x.com/maharshii/status/1994035125168775627)  2025-11-27T13:26Z 42.1K followers, 16.2K engagements


"flux [--] dev 4:3 image [--] steps [---] seconds how does [--] second flux2 inference sound 👀 how does [--] second flux2 inference sound 👀"  
[X Link](https://x.com/maharshii/status/1994399733008531732)  2025-11-28T13:35Z 42.1K followers, [----] engagements


"flux [--] dev 9:16 image [--] steps [----] seconds flux [--] dev 4:3 image [--] steps [---] seconds https://t.co/uowWa3urcK flux [--] dev 4:3 image [--] steps [---] seconds https://t.co/uowWa3urcK"  
[X Link](https://x.com/maharshii/status/1994401285815312560)  2025-11-28T13:41Z 42.1K followers, 10.1K engagements


"@cheatyyyy thanks you can try 1:1 too it's still at 1.6s so sub two-second"  
[X Link](https://x.com/maharshii/status/1994404496949366974)  2025-11-28T13:54Z 42.1K followers, [--] engagements


"self referential prompt with flux [--] on fal 9:16 image for [--] steps takes only [----] seconds to generate flux [--] dev 9:16 image [--] steps [----] seconds https://t.co/8ZbLW4jkF2 flux [--] dev 9:16 image [--] steps [----] seconds https://t.co/8ZbLW4jkF2"  
[X Link](https://x.com/maharshii/status/1994799871757619424)  2025-11-29T16:05Z 42.1K followers, [----] engagements


"falcon effect: all these images are generated in [---] seconds with flux2 on fal how does [--] second flux2 inference sound 👀 how does [--] second flux2 inference sound 👀"  
[X Link](https://x.com/maharshii/status/1994907612044038403)  2025-11-29T23:13Z 42.1K followers, [----] engagements


"in summer [----] when i started posting about my projects like smolgrad numpy in C blogs on learning CUDA and so on my motivation to post was not at all for money. it was to find like-minded people who had an actual interest in building these kind of projects. it was to get a chance to discuss cool things with cool people. its really sad to see the state of indian tech twitter (and tpot) reach the point where instead of discussing about the nuances of tech interesting projects and trying something new people are instead focused on selling their souls getting the next banger and the next $10k"  
[X Link](https://x.com/maharshii/status/1995102423141249232)  2025-11-30T12:07Z 42.1K followers, 129.6K engagements


"holy shit if you showed me this image without the gemini watermark i would have believed that it is a real photo tried JSON prompting with nano banana pro to get highly specific image. the result is pretty insane. https://t.co/thbDTloAcz tried JSON prompting with nano banana pro to get highly specific image. the result is pretty insane. https://t.co/thbDTloAcz"  
[X Link](https://x.com/maharshii/status/1996186586686586988)  2025-12-03T11:55Z 42.1K followers, [----] engagements


"bytedances async ulysses attention is deceptively simple to understand and when you have a faster all-to-all kernel than NCCL the communication can be very well overlapped with computation"  
[X Link](https://x.com/maharshii/status/1996280889962365380)  2025-12-03T18:10Z 42.1K followers, 10.1K engagements


"in higher dimensional spaces everyone is equally far away from you. as the dimensions grow the distance between the nearest and the farthest point approaches zero difference"  
[X Link](https://x.com/maharshii/status/1996518992404779456)  2025-12-04T09:56Z 42.1K followers, [----] engagements


"you can compile flash attention [--] in under 1-2 minutes on H100 by simply setting some environment variables it will only compile for the options and head dims you want e.g. 128"  
[X Link](https://x.com/maharshii/status/1996552990279483648)  2025-12-04T12:11Z 42.1K followers, 11.5K engagements


"nano banana pro has really crossed the uncanny valley for me: i used the prompt which @Aval_utionary sent me to generate the below image describing dynamic dispatch of specialized CUDA kernels look at the details"  
[X Link](https://x.com/maharshii/status/1996960152751382789)  2025-12-05T15:09Z 42.1K followers, 18.4K engagements


"you can just fuse things"  
[X Link](https://x.com/maharshii/status/1997250392233754747)  2025-12-06T10:22Z 42.2K followers, 27.9K engagements


"everyones starting to get into cuda and gpu kernel programming recently and im all here for it we need more"  
[X Link](https://x.com/maharshii/status/1997637619665019374)  2025-12-07T12:01Z 42.1K followers, 20.3K engagements


"if you have an interest in molecular biology or just want to learn cool things about a field other than AI/ML to refresh your mind you should follow Aval she just published her first blog post here and is planning to write more i am publishing my first blog in my "Trust Me I'm a Scientist" series: how transcription works (DNA RNA proteins) topics covered: why DNA never leaves the nucleus what RNA actually does the step-by-step transcription process how cells manufacture proteins https://t.co/I34MkEPa3l i am publishing my first blog in my "Trust Me I'm a Scientist" series: how transcription"  
[X Link](https://x.com/maharshii/status/1998033713914065235)  2025-12-08T14:15Z 42.1K followers, 11.8K engagements


"it still feels crazy to me that we can have a mathematical model which treats antimatter as going backwards in time and the math works out perfectly"  
[X Link](https://x.com/maharshii/status/1998086733028888824)  2025-12-08T17:46Z 42.1K followers, [----] engagements


"idk if its just a mathematical tool but if antimatter is actually going backwards in time then the quantum possibility realized in a moment must be in coherence with all the past/future events so there wont be any free will or we are most likely missing something"  
[X Link](https://x.com/maharshii/status/1998087946776518932)  2025-12-08T17:51Z 42.1K followers, [----] engagements


"this variety pairing was one for the history books https://t.co/JVn3kEp7V7 this variety pairing was one for the history books https://t.co/JVn3kEp7V7"  
[X Link](https://x.com/maharshii/status/1998351066417775005)  2025-12-09T11:16Z 42.3K followers, 571.4K engagements


"so torch 2.9.0 with cuda [----] calls cudnn attention backend by default but torch 2.9.0 with cuda [----] does not this is very interesting"  
[X Link](https://x.com/maharshii/status/1998722017894477879)  2025-12-10T11:50Z 42.2K followers, [----] engagements


"fun fact: [--] months ago i figured this out too independently and all of the existing rope kernels then treated QK as separate instead of combining them the results match with unsloth benches here. another such fusion is possible that fuses [--] kernels to only [--] for inference"  
[X Link](https://x.com/maharshii/status/1999060405382193190)  2025-12-11T10:15Z 42.1K followers, [----] engagements


"literally everyone on my TL is just using the recent drama to farm engagement by including indians in their posts nobody is better than anybody here"  
[X Link](https://x.com/maharshii/status/1999786369472995622)  2025-12-13T10:19Z 42.1K followers, [----] engagements


"4 bits is all you need"  
[X Link](https://x.com/maharshii/status/2000191221340496145)  2025-12-14T13:08Z 42.1K followers, [----] engagements


"watched dhurandhar yesterday and it deserves all the hype it gets india needs more movies like this to awaken the people"  
[X Link](https://x.com/maharshii/status/2000860160252919919)  2025-12-16T09:26Z 42.4K followers, 23.2K engagements


"i love japanese food"  
[X Link](https://x.com/maharshii/status/2002274199675810002)  2025-12-20T07:05Z 42.4K followers, 252.9K engagements


"my [----] wrapped: - job at a dream company (fal) - had my first international trip (thailand) - learnt a ton about ML optimizations - wrote (and maintaining) an inference engine from scratch - wrote more and more custom GPU kernels - making the dad you can quit money - got to 40k+ followers on X - still posting quality content on X - travelled to a lot of different cities - met and befriended amazingly talented people im always grateful lets see what [----] brings. my [----] wrapped: - bought a house - remote job at a place i like - can research about ML/DL on weekends - built some cool things"  
[X Link](https://x.com/maharshii/status/2004131890580934998)  2025-12-25T10:07Z 42.3K followers, 94.5K engagements


"@rajvishah30 true another way is to just embrace whatever you are feeling at the moment recently whenever i get overwhelmed by something i always ask myself what is it that im feeling and that works quite good for me"  
[X Link](https://x.com/maharshii/status/2005005680680550573)  2025-12-27T19:59Z 42.4K followers, [---] engagements


"i can really use some motivation what are some cool projects you are working on currently"  
[X Link](https://x.com/maharshii/status/2005641166973444473)  2025-12-29T14:04Z 42.3K followers, 15.1K engagements


"saved this image right before my X feed was auto refreshed generational clutch"  
[X Link](https://x.com/maharshii/status/2006319704756138121)  2025-12-31T11:01Z 42.2K followers, 111.5K engagements


"the saying sufficiently advanced technology is indistinguishable from magic hits a bit harder for computers and their chips than for anything else"  
[X Link](https://x.com/maharshii/status/2006665146199122380)  2026-01-01T09:53Z 42.3K followers, [----] engagements


"for the people who don't know cuda [----] on blackwell (SM100 and higher) added support for 256-bit vectorized load instruction compared to the 128-bit loads that you can write as inline PTX and it emits the corresponding SASS"  
[X Link](https://x.com/maharshii/status/2010024279421923451)  2026-01-10T16:21Z 42.2K followers, 13.7K engagements


"after refactoring the divisions to multiplications along with using the approximate reciprocal instruction my custom kernel became 2-3% faster. there's also a div.approx.f32 instruction which i need to test"  
[X Link](https://x.com/maharshii/status/2010314586520723622)  2026-01-11T11:35Z 42.3K followers, [----] engagements


"i like NVFP4 more than MXFP4 and i like FP4 more than FP8 sorry i dont make the rules"  
[X Link](https://x.com/maharshii/status/2010806112124358790)  2026-01-12T20:08Z 42.2K followers, [----] engagements


"@Aval_utionary elite ball knowledge"  
[X Link](https://x.com/maharshii/status/2014054527133958320)  2026-01-21T19:16Z 42.2K followers, [---] engagements


"You can just learn things by yourself"  
[X Link](https://x.com/maharshii/status/1805084233687257446)  2024-06-24T03:43Z 42.3K followers, 59.5K engagements


"can't believe depth segmentation can be done in only [--] milliseconds"  
[X Link](https://x.com/maharshii/status/1829504574832669123)  2024-08-30T13:00Z 42.1K followers, 499.3K engagements


""Kernel [--] - Online softmax" fuses the first two passes over the row to only one pass to calculate the max and norm values. It exploits the property of multiplying exponentials. It results in 28.12% speedup compared to naive kernel"  
[X Link](https://x.com/maharshii/status/1875867782509998573)  2025-01-05T11:31Z 42.2K followers, [---] engagements


"hands down the best definition of a tensor"  
[X Link](https://x.com/maharshii/status/1922652408246677996)  2025-05-14T13:57Z 42.2K followers, 81.2K engagements


"ltx video [----] distilled is incredibly good for generating super long videos"  
[X Link](https://x.com/maharshii/status/1947916026890621443)  2025-07-23T07:05Z 42.2K followers, [----] engagements


"thank me later: when you have a function like below which uses float/int arguments then by default torch will create guards around it that uses _as_tensor_fullprec to convert the float/int to a tensor with full precision i.e. int64 or float64 which can increase the dynamo cache lookup time by a lot when executing the compiled function. this can be easily mitigated by setting specialize_int and specialize_float to True in dynamo config"  
[X Link](https://x.com/maharshii/status/1988633532307304451)  2025-11-12T15:42Z 42.2K followers, 15.4K engagements


"@fal fal is so ahead on every front"  
[X Link](https://x.com/maharshii/status/1993672327293534497)  2025-11-26T13:24Z 42.2K followers, [----] engagements


"i noticed faster compilation times for different shapes if i treat custom kernels (having inline asm elementwise) as opaque torch operators when fullgraph=True and dynamic=True"  
[X Link](https://x.com/maharshii/status/2017573681636237382)  2026-01-31T12:20Z 42.3K followers, [----] engagements


"to the people/bots hating on my indias 5G post cope harder"  
[X Link](https://x.com/anyuser/status/1939767690090152254)  2025-06-30T19:27Z 42.2K followers, 397.9K engagements


"nothing humbles me more than this picture"  
[X Link](https://x.com/maharshii/status/1974075757900263696)  2025-10-03T11:35Z 42.2K followers, 135.5K engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing

@maharshii maharshi

maharshi posts on X about inference, if you, this is, faster the most. They currently have [------] followers and [---] posts still getting attention that total [------] engagements in the last [--] hours.

Engagements: [------] #

[--] Week [------] -62%
[--] Month [-------] -27%
[--] Months [---------] -52%
[--] Year [----------] -79%

Mentions: [--] #

[--] Month [--] -74%
[--] Months [---] -35%
[--] Year [---] -44%

Followers: [------] #

[--] Week [------] -0.07%
[--] Month [------] -0.03%
[--] Months [------] +6.50%
[--] Year [------] +28%

CreatorRank: [-------] #

Social Influence

Social category influence countries 3% finance 2% social networks 1% technology brands 1%

Social topic influence inference 8%, if you 6%, this is 5%, faster 5%, code 4%, more than 4%, books #1901, smartly #44, nano banana 3%, what is 2%

Top accounts mentioned or mentioned by @mobicham @drisspg @gaunernst @leik0w0 @avalutionary @snowclipsed @ailker @dejavucoder @sasuke420 @cheatyyyy @rajvishah30 @fal @isidentical @burkaygur @dprophecyguy @caneystudios @svarunid @alpindale @pdt2211 @adsha44

Top Social Posts

Top posts by engagements in the last [--] hours

"sometimes youre just scrolling and then boom you get hit by the deepest most thought provoking sentence you pity the moth confusing a lamp for the moon yet here you are confusing a screen for the world you pity the moth confusing a lamp for the moon yet here you are confusing a screen for the world"
X Link 2026-02-06T17:13Z 42.2K followers, 72.4K engagements

"the concept of time does not exist in the 4th dimension and inside an airport"
X Link 2026-02-07T23:03Z 42.2K followers, [----] engagements

"this was an eye-opening [--] hour [--] minutes video none of what he explained would contradict a reasonable mind"
X Link 2026-02-03T09:37Z 42.2K followers, 226.3K engagements

"im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many people may not be aware about it so go read https://blog.fal.ai/instruction-level-control-with-inline-elementwise-asm-in-triton/ https://blog.fal.ai/instruction-level-control-with-inline-elementwise-asm-in-triton/"
X Link 2026-02-10T16:55Z 42.2K followers, 10.6K engagements

"i explain the usage of inline elementwise asm in triton using [--] examples in the blog post: - single instruction (rcp) - multiple instructions and packing - nvfp4 quantization kernel read and let me know im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many people may not be aware about it so go read https://t.co/4CskBUOSlg im publishing a new blog post on this insanely useful feature of triton: it is what makes the custom triton NVFP4 quant kernel go hand-in-hand or beat CUDA. many"
X Link 2026-02-11T13:43Z 42.2K followers, [----] engagements

"train and inference GPT in [--] line of python code: exec("import osmathrandomurllib.request as u;random.seed(1);e=16;b=8;s=1000nif not 'i')nt=open('i').read().split();c='''$'+sorted(set(''.join(t)));n=len(c);d=x:i for ix in enumerate(c);r=i:x for xi in d.items()nclass V:n def init(axp=()):a.x=x;a.g=0;a.p=pn def add(ao):o=o if isinstance(oV) else V(o);z=V(a.x+o.x(ao));z.b=lambda:(setattr(a'g'a.g+z.g)setattr(o'g'o.g+z.g));return zn def mul(ao):o=o if isinstance(oV) else V(o);z=V(a.xo.x(ao));z.b=lambda:(setattr(a'g'a.g+o.xz.g)setattr(o'g'o.g+a.x*z.g));return zn def"
X Link 2026-02-12T17:21Z 42.2K followers, 13.1K engagements

"the butterfly the sun the books he is standing on the universe he himself everything is god https://t.co/ZsJj6jVcck https://t.co/ZsJj6jVcck"
X Link 2026-02-13T10:55Z 42.2K followers, [----] engagements

"i think instead of doing courses for technical subjects people should devote some more time and learn things from books. if you want to go into the trenches of any technical subject the amount of knowledge and detail a book can give is unmatched"
X Link 2025-05-04T13:50Z 42.2K followers, 40K engagements

"i have a secret to get to 80-100LPA in india and its not DSA some people might guess what it is. DSA is still the only source of truth if you want to earn 40-50LPA as a fresher in the tech industry. DSA is still the only source of truth if you want to earn 40-50LPA as a fresher in the tech industry"
X Link 2025-10-08T12:31Z 42.2K followers, 393.1K engagements

"@mobicham yes true but will the lookup table method be faster than the instruction here (if you have experimented with that)"
X Link 2026-02-12T12:47Z 42.2K followers, [---] engagements

"@mobicham i think this is for gemv do you have a triton kernel for quantizing to fp4 with lookup table or you are writing it in pytorch and doing fullgraph compile"
X Link 2026-02-12T13:26Z 42.2K followers, [--] engagements

"@mobicham i see will test which one is faster for sm100"
X Link 2026-02-12T13:42Z 42.2K followers, [--] engagements

"@snowclipsed conditional on the shapes or speed yeah its kinda absurd but thats all we have so 😭"
X Link 2026-02-12T18:46Z 42.2K followers, [---] engagements

"@ailker roll credits 😂"
X Link 2026-02-12T22:40Z 42.2K followers, [---] engagements

"@dejavucoder yessir i was here 2h ago"
X Link 2026-02-14T05:26Z 42.2K followers, [----] engagements

"imagine showing this to a caveman"
X Link 2026-02-12T17:26Z 42.2K followers, [----] engagements

"the bad news is that nothing matters the good news is that nothing matters"
X Link 2026-02-15T03:45Z 42.2K followers, [----] engagements

"many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this"
X Link 2026-02-04T13:24Z 42.2K followers, 39.1K engagements

"this is not bad at all many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx"
X Link 2026-02-05T15:13Z 42.2K followers, [----] engagements

"this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for smaller shapes there's some overhead that i need to eliminate many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i write a blog post on this https://t.co/WeyqETJGsx many people may not be aware that 'smartly' written triton (in less than [---] lines) can achieve the same speed as a CUDA kernel that contains more than [----] lines of code should i"
X Link 2026-02-06T14:15Z 42.2K followers, 10.5K engagements

"wrote a custom CUDA kernel that uses 256-bit gmem loads starting from version [----] on blackwell and for smaller shapes it is indeed faster than triton until the shape (8192 8192). it still seems to be slower on larger shapes i bet there is room for more speed. this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for smaller shapes there's some overhead that i need to eliminate https://t.co/w2nuw9DJ7P this is a much better comparison: on bigger shapes i'm able to beat 2k lines of cuda with [---] lines of triton although for"
X Link 2026-02-10T15:15Z 42.2K followers, [----] engagements

"it is a short n sweet kernel (200 lines) including all the inline PTX shenanigans :D"
X Link 2026-02-10T15:18Z 42.2K followers, [---] engagements

"seedance [---] has passed the uncanny valley for me its so good i wanna see what kind of dataset is it trained on"
X Link 2026-02-11T11:40Z 42.2K followers, 18.4K engagements

"Always inspect your PTX. For context: below is the instruction used for computing GEMM of nvfp4 x nvfp4 A and B. The instruction does [--] CTA GEMM and block16 is alias for scale_vec_size::4X. It delivers 4X more throughput on blackwell compared to FP8 tensor cores on hopper"
X Link 2026-02-12T18:26Z 42.2K followers, [----] engagements

"i love japanese food"
X Link 2026-02-14T04:41Z 42.2K followers, 20.9K engagements

"what are some fun and underrated things to do in blr kormangala considering im here solo for 2-3 days"
X Link 2026-02-14T06:27Z 42.2K followers, 11.1K engagements

"bit of chill time"
X Link 2026-02-16T20:00Z 42.2K followers, [----] engagements

"i wont be sharing it as of now because i need to check the accuracy of quant - dequant round trip on more matrices along with some other stuff :)"
X Link 2025-11-03T13:16Z 41.9K followers, [----] engagements

"@drisspg @gaunernst can you help me understand: i am able to pass input_scale and weight_scale to _scaled_mm but i don't see a way to pass the global scale (needed for nvfp4) what is the correct way to pass it because scale_result has no effect on it"
X Link 2025-11-04T14:38Z 41.9K followers, [---] engagements

"update: now using cublas instead of cutlass and i already see a big jump it achieves [---] PFLOPS with quantization overhead (thanks to fast quant kernel) and [---] PFLOPS for the gemm alone pretty cool. maybe i'm doing something wrong but i don't see NVFP4 gemm anywhere close to [--] TFLOPS even with CUTLASS on SM100 does anyone know why this is the case and what else can be done https://t.co/TvOtVkvlov maybe i'm doing something wrong but i don't see NVFP4 gemm anywhere close to [--] TFLOPS even with CUTLASS on SM100 does anyone know why this is the case and what else can be done"
X Link 2025-11-04T15:17Z 41.9K followers, 11.5K engagements

"@Leik0w0 @drisspg @gaunernst sure i actually might the link which @drisspg sent is pretty self explanatory by itself. i had to use torch nightly though. the accuracy is good too"
X Link 2025-11-04T15:22Z 41.9K followers, [---] engagements

"@drisspg @Leik0w0 @gaunernst just curious: its available only in torch nightly or the stable versions as well"
X Link 2025-11-04T15:38Z 41.9K followers, [---] engagements

"another interesting update: for irregular shapes like below that we see in practice the cutlass kernel that i have actually beats cublas on NVFP4"
X Link 2025-11-05T10:08Z 41.9K followers, [----] engagements

"made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it can go even faster"
X Link 2025-11-05T17:26Z 41.9K followers, 31.9K engagements

"NVFP4 quantization procedure for those who care: In essence it is a block-scaling method so a block of [--] consecutive elements gets assigned a scale factor. Given a tensor x each block b (= 16) of consecutive elements in higher precision gets quantized to NVFP4. There are two scale factors at play here: Global and Local. The global scale factor is a single FP32 value that moves the range of values that a block b can represent (FP4 value x FP8 scale). The local scale moves the individual values x_i to the FP4 representable range. For global scale tensor it is defined as: s_enc = (6 * 448) /"
X Link 2025-11-05T19:23Z 41.9K followers, 19.7K engagements

"i cant believe everything i learned about NVFP4 was in the last [--] days and i managed to calibrate and run Wan [---] with coherent video outputs this is the fastest i have learnt something new (yet)"
X Link 2025-11-06T09:04Z 41.9K followers, [----] engagements

"in the current state the end-to-end speedup i get with NVFP4 on Wan [---] is [--] seconds so around 8% overall. in comparison per-row FP8 quantization gave me a speedup of at most [--] seconds. also i overcame the limitation of [---] x [--] tile shape in the quant kernel itself. made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it can go even faster https://t.co/o38KTo9xVB made Wan [---] work with NVFP4 using my cutlass kernel and i do see a decent speedup with good enough accuracy i bet i'm missing something and it"
X Link 2025-11-06T13:52Z 41.9K followers, [----] engagements

"from triton's block scaled matmul tutorial"
X Link 2025-11-07T06:37Z 41.9K followers, [----] engagements

"the NVFP4 quant kernel achieves 3.2TB/s memory throughput and has a runtime of 33us which seems to be good enough: NCU complains about non-fused FP32 operations but all i see is FMUL so that might be a false alarm. oh wow https://t.co/jFu68EFBk1 oh wow https://t.co/jFu68EFBk1"
X Link 2025-11-07T13:22Z 41.9K followers, [----] engagements

"@sasuke___420 it complains about fmul operations only and i dont see fma ops that can be applied anywhere within the kernel"
X Link 2025-11-08T07:44Z 41.9K followers, [--] engagements

"if you only do what you can do youll never be more than what you are now"
X Link 2025-11-08T13:32Z 41.9K followers, [----] engagements

"when you bully someone youre bullying yourself and when you help someone youre helping yourself; when everything comes from the same source the illusion of distinction disappears"
X Link 2025-11-08T17:05Z 41.9K followers, [----] engagements

"why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic"
X Link 2025-11-09T11:02Z 41.9K followers, 41.4K engagements

"update: it feels good to be a bit faster than TensorRT i might explain later (no promises though) why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic. why is tritons kernel launch cpu overhead so freaking high the actual kernel takes 10x less execution time than to launch it and i cant use cuda graphs because the shapes are dynamic"
X Link 2025-11-09T18:11Z 41.9K followers, [----] engagements

"idk who needs to hear this but you can register forward/backward pre and post hooks in pytorch modules if you want to intercept the values it's quite useful"
X Link 2025-11-10T14:07Z 41.9K followers, 10.7K engagements

"quick question: will torch dynamo specialize ints and floats and 'bake' them into the graph i.e. not put guards on them when we compile with dynamic=True or not"
X Link 2025-11-11T12:16Z 41.9K followers, [----] engagements

"stop whatever you are doing right now and watch Frankenstein on netflix what an absolute masterpiece"
X Link 2025-11-15T03:13Z 41.9K followers, 61.6K engagements

"wow on consumer GPUs the speedups are quite insane with pure cudagraphs: this is very usable for the cases when there are unnecessary kernel launch overheads"
X Link 2025-11-15T09:33Z 41.9K followers, 13.1K engagements

"TMA transfers require 1024-byte alignment so if you wanna make this work for imperfect shapes you will need to manually align the SMEM to [----] bytes and expect only the required bytes for the transfer but this is amazing for educational purposes"
X Link 2025-11-18T15:41Z 41.9K followers, [----] engagements

"we have had more cloud outages ever since AI vibe coding happened what if the way AI improves humanity is by bringing all of the internet down"
X Link 2025-11-18T21:07Z 42.1K followers, [----] engagements

"i love writing my own C++ templates but i hate reading someone elses C++ templates"
X Link 2025-11-19T08:10Z 41.9K followers, [----] engagements

"im insanely deep into the CUDA trenches but im very sure that the gains will be worth it"
X Link 2025-11-20T17:18Z 41.9K followers, 33K engagements

"an underrated part of this is: we can auto-generate CUDA kernels based on some predefined templates and configurations register them globally at load time and then dispatch to the correct kernel at runtime without any overhead. this is how flash attention does it too. im insanely deep into the CUDA trenches but im very sure that the gains will be worth it. im insanely deep into the CUDA trenches but im very sure that the gains will be worth it"
X Link 2025-11-20T18:23Z 42.1K followers, 39K engagements

"in my opinion the best way to use LLMs is to discuss about the problem with them rather than directly asking for the code the mere back and forth of ideas and suggestions will give you much more clarity than trying to one-shot the code"
X Link 2025-11-22T13:57Z 41.9K followers, 17.3K engagements

"nano banana [--] is insane"
X Link 2025-11-23T18:56Z 42.1K followers, 24.3K engagements

"writing an inference engine is a spiritual experience but the feeling is even greater when every building block and every kernel comes together to work like magic"
X Link 2025-11-24T14:46Z 42.1K followers, [----] engagements

"it takes less than [--] second to generate an image with this model on fal thanks to our inference engine the image quality is top-tier as well. 1/ [--] We are pleased to introduce Z-Image an efficient 6-billion-parameter foundation model for image generation. Through systematic optimization it proves that top-tier performance is achievable without relying on enormous model sizes delivering strong results in https://t.co/OhUDL3PfSH 1/ [--] We are pleased to introduce Z-Image an efficient 6-billion-parameter foundation model for image generation. Through systematic optimization it proves that"
X Link 2025-11-27T08:24Z 42.1K followers, [----] engagements

"how does [--] second flux2 inference sound 👀"
X Link 2025-11-27T13:26Z 42.1K followers, 16.2K engagements

"flux [--] dev 4:3 image [--] steps [---] seconds how does [--] second flux2 inference sound 👀 how does [--] second flux2 inference sound 👀"
X Link 2025-11-28T13:35Z 42.1K followers, [----] engagements

"flux [--] dev 9:16 image [--] steps [----] seconds flux [--] dev 4:3 image [--] steps [---] seconds https://t.co/uowWa3urcK flux [--] dev 4:3 image [--] steps [---] seconds https://t.co/uowWa3urcK"
X Link 2025-11-28T13:41Z 42.1K followers, 10.1K engagements

"@cheatyyyy thanks you can try 1:1 too it's still at 1.6s so sub two-second"
X Link 2025-11-28T13:54Z 42.1K followers, [--] engagements

"self referential prompt with flux [--] on fal 9:16 image for [--] steps takes only [----] seconds to generate flux [--] dev 9:16 image [--] steps [----] seconds https://t.co/8ZbLW4jkF2 flux [--] dev 9:16 image [--] steps [----] seconds https://t.co/8ZbLW4jkF2"
X Link 2025-11-29T16:05Z 42.1K followers, [----] engagements

"falcon effect: all these images are generated in [---] seconds with flux2 on fal how does [--] second flux2 inference sound 👀 how does [--] second flux2 inference sound 👀"
X Link 2025-11-29T23:13Z 42.1K followers, [----] engagements

"in summer [----] when i started posting about my projects like smolgrad numpy in C blogs on learning CUDA and so on my motivation to post was not at all for money. it was to find like-minded people who had an actual interest in building these kind of projects. it was to get a chance to discuss cool things with cool people. its really sad to see the state of indian tech twitter (and tpot) reach the point where instead of discussing about the nuances of tech interesting projects and trying something new people are instead focused on selling their souls getting the next banger and the next $10k"
X Link 2025-11-30T12:07Z 42.1K followers, 129.6K engagements

"holy shit if you showed me this image without the gemini watermark i would have believed that it is a real photo tried JSON prompting with nano banana pro to get highly specific image. the result is pretty insane. https://t.co/thbDTloAcz tried JSON prompting with nano banana pro to get highly specific image. the result is pretty insane. https://t.co/thbDTloAcz"
X Link 2025-12-03T11:55Z 42.1K followers, [----] engagements

"bytedances async ulysses attention is deceptively simple to understand and when you have a faster all-to-all kernel than NCCL the communication can be very well overlapped with computation"
X Link 2025-12-03T18:10Z 42.1K followers, 10.1K engagements

"in higher dimensional spaces everyone is equally far away from you. as the dimensions grow the distance between the nearest and the farthest point approaches zero difference"
X Link 2025-12-04T09:56Z 42.1K followers, [----] engagements

"you can compile flash attention [--] in under 1-2 minutes on H100 by simply setting some environment variables it will only compile for the options and head dims you want e.g. 128"
X Link 2025-12-04T12:11Z 42.1K followers, 11.5K engagements

"nano banana pro has really crossed the uncanny valley for me: i used the prompt which @Aval_utionary sent me to generate the below image describing dynamic dispatch of specialized CUDA kernels look at the details"
X Link 2025-12-05T15:09Z 42.1K followers, 18.4K engagements

"you can just fuse things"
X Link 2025-12-06T10:22Z 42.2K followers, 27.9K engagements

"everyones starting to get into cuda and gpu kernel programming recently and im all here for it we need more"
X Link 2025-12-07T12:01Z 42.1K followers, 20.3K engagements

"if you have an interest in molecular biology or just want to learn cool things about a field other than AI/ML to refresh your mind you should follow Aval she just published her first blog post here and is planning to write more i am publishing my first blog in my "Trust Me I'm a Scientist" series: how transcription works (DNA RNA proteins) topics covered: why DNA never leaves the nucleus what RNA actually does the step-by-step transcription process how cells manufacture proteins https://t.co/I34MkEPa3l i am publishing my first blog in my "Trust Me I'm a Scientist" series: how transcription"
X Link 2025-12-08T14:15Z 42.1K followers, 11.8K engagements

"it still feels crazy to me that we can have a mathematical model which treats antimatter as going backwards in time and the math works out perfectly"
X Link 2025-12-08T17:46Z 42.1K followers, [----] engagements

"idk if its just a mathematical tool but if antimatter is actually going backwards in time then the quantum possibility realized in a moment must be in coherence with all the past/future events so there wont be any free will or we are most likely missing something"
X Link 2025-12-08T17:51Z 42.1K followers, [----] engagements

"this variety pairing was one for the history books https://t.co/JVn3kEp7V7 this variety pairing was one for the history books https://t.co/JVn3kEp7V7"
X Link 2025-12-09T11:16Z 42.3K followers, 571.4K engagements

"so torch 2.9.0 with cuda [----] calls cudnn attention backend by default but torch 2.9.0 with cuda [----] does not this is very interesting"
X Link 2025-12-10T11:50Z 42.2K followers, [----] engagements

"fun fact: [--] months ago i figured this out too independently and all of the existing rope kernels then treated QK as separate instead of combining them the results match with unsloth benches here. another such fusion is possible that fuses [--] kernels to only [--] for inference"
X Link 2025-12-11T10:15Z 42.1K followers, [----] engagements

"literally everyone on my TL is just using the recent drama to farm engagement by including indians in their posts nobody is better than anybody here"
X Link 2025-12-13T10:19Z 42.1K followers, [----] engagements

"4 bits is all you need"
X Link 2025-12-14T13:08Z 42.1K followers, [----] engagements

"watched dhurandhar yesterday and it deserves all the hype it gets india needs more movies like this to awaken the people"
X Link 2025-12-16T09:26Z 42.4K followers, 23.2K engagements

"i love japanese food"
X Link 2025-12-20T07:05Z 42.4K followers, 252.9K engagements

"my [----] wrapped: - job at a dream company (fal) - had my first international trip (thailand) - learnt a ton about ML optimizations - wrote (and maintaining) an inference engine from scratch - wrote more and more custom GPU kernels - making the dad you can quit money - got to 40k+ followers on X - still posting quality content on X - travelled to a lot of different cities - met and befriended amazingly talented people im always grateful lets see what [----] brings. my [----] wrapped: - bought a house - remote job at a place i like - can research about ML/DL on weekends - built some cool things"
X Link 2025-12-25T10:07Z 42.3K followers, 94.5K engagements

"@rajvishah30 true another way is to just embrace whatever you are feeling at the moment recently whenever i get overwhelmed by something i always ask myself what is it that im feeling and that works quite good for me"
X Link 2025-12-27T19:59Z 42.4K followers, [---] engagements

"i can really use some motivation what are some cool projects you are working on currently"
X Link 2025-12-29T14:04Z 42.3K followers, 15.1K engagements

"saved this image right before my X feed was auto refreshed generational clutch"
X Link 2025-12-31T11:01Z 42.2K followers, 111.5K engagements

"the saying sufficiently advanced technology is indistinguishable from magic hits a bit harder for computers and their chips than for anything else"
X Link 2026-01-01T09:53Z 42.3K followers, [----] engagements

"for the people who don't know cuda [----] on blackwell (SM100 and higher) added support for 256-bit vectorized load instruction compared to the 128-bit loads that you can write as inline PTX and it emits the corresponding SASS"
X Link 2026-01-10T16:21Z 42.2K followers, 13.7K engagements

"after refactoring the divisions to multiplications along with using the approximate reciprocal instruction my custom kernel became 2-3% faster. there's also a div.approx.f32 instruction which i need to test"
X Link 2026-01-11T11:35Z 42.3K followers, [----] engagements

"i like NVFP4 more than MXFP4 and i like FP4 more than FP8 sorry i dont make the rules"
X Link 2026-01-12T20:08Z 42.2K followers, [----] engagements

"@Aval_utionary elite ball knowledge"
X Link 2026-01-21T19:16Z 42.2K followers, [---] engagements

"You can just learn things by yourself"
X Link 2024-06-24T03:43Z 42.3K followers, 59.5K engagements

"can't believe depth segmentation can be done in only [--] milliseconds"
X Link 2024-08-30T13:00Z 42.1K followers, 499.3K engagements

""Kernel [--] - Online softmax" fuses the first two passes over the row to only one pass to calculate the max and norm values. It exploits the property of multiplying exponentials. It results in 28.12% speedup compared to naive kernel"
X Link 2025-01-05T11:31Z 42.2K followers, [---] engagements

"hands down the best definition of a tensor"
X Link 2025-05-14T13:57Z 42.2K followers, 81.2K engagements

"ltx video [----] distilled is incredibly good for generating super long videos"
X Link 2025-07-23T07:05Z 42.2K followers, [----] engagements

"thank me later: when you have a function like below which uses float/int arguments then by default torch will create guards around it that uses _as_tensor_fullprec to convert the float/int to a tensor with full precision i.e. int64 or float64 which can increase the dynamo cache lookup time by a lot when executing the compiled function. this can be easily mitigated by setting specialize_int and specialize_float to True in dynamo config"
X Link 2025-11-12T15:42Z 42.2K followers, 15.4K engagements

"@fal fal is so ahead on every front"
X Link 2025-11-26T13:24Z 42.2K followers, [----] engagements

"i noticed faster compilation times for different shapes if i treat custom kernels (having inline asm elementwise) as opaque torch operators when fullgraph=True and dynamic=True"
X Link 2026-01-31T12:20Z 42.3K followers, [----] engagements

"to the people/bots hating on my indias 5G post cope harder"
X Link 2025-06-30T19:27Z 42.2K followers, 397.9K engagements

"nothing humbles me more than this picture"
X Link 2025-10-03T11:35Z 42.2K followers, 135.5K engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing