[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
@gaunernst
"@QGallouedec I believe u r referring to common non-fused PyTorch impl of RMSNorm. Actually it's doing normalization in FP32 but scaling in BF16. And I guess u found out that doing multiplication in BF16 or FP32 can lead to HUGE difference in the final logits (I was surprised too)"
X Link @gaunernst 2025-11-01T14:31Z 1529 followers, XXX engagements
"Writing multi-GPU kernels is easier than I thought - you can just allocate memory exchange IPC handles then use it (assuming you have P2P)"
X Link @gaunernst 2025-11-02T14:36Z 1533 followers, 1304 engagements