[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

@exolabs "We can run these two stages on different devices: Prefill: DGX Spark (high compute device 4x compute) Decode: M3 Ultra (high memory-bandwidth device 3x memory-bandwidth) However now we need to transfer the KV cache over the network (10GbE). This introduces a delay"
X Link @exolabs 2025-10-15T18:18Z 37.8K followers, 9040 engagements

"LLM inference consists of a prefill and decode stage. Prefill processes the prompt building a KV cache. Its compute-bound so gets faster with more FLOPS. Decode reads the KV cache and generates tokens one by one. Its memory-bound so gets faster with more memory bandwidth"
X Link @exolabs 2025-10-15T18:18Z 37.8K followers, 10.5K engagements

"Clustering NVIDIA DGX Spark + M3 Ultra Mac Studio for 4x faster LLM inference. DGX Spark: 128GB @ 273GB/s XXX TFLOPS (fp16) $3999 M3 Ultra: 256GB @ 819GB/s XX TFLOPS (fp16) $5599 The DGX Spark has 3x less memory bandwidth than the M3 Ultra but 4x more FLOPS. By running compute-bound prefill on the DGX Spark memory-bound decode on the M3 Ultra and streaming the KV cache over 10GbE we are able to get the best of both hardware with massive speedups. Short explanation in this thread & link to full blog post below"
X Link @exolabs 2025-10-15T18:17Z 37.8K followers, 301K engagements