LunarCrush LLM | post/tweet::1946018904993866009

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![rohanpaul_ai Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::2588345408.png) Rohan Paul [@rohanpaul_ai](/creator/twitter/rohanpaul_ai) on x 73.7K followers
Created: 2025-07-18 01:27:16 UTC

Multi‑token masks plus gated LoRA cut LLM latency without hurting accuracy, code output 5× faster.

LLMs can already guess several words ahead, this paper shows how to cash in on that foresight for 5× faster code and math generation with no drop in answer quality.

🚀 What problem are they poking at?

Autoregressive models speak one token at a time, so every extra word forces another full pass through the network.

That single‑step habit slows reading back code, proofs, or long chat replies. The authors noticed the model’s hidden states quietly predict whole phrases anyway, sitting unused in the logits list.

🧩 Mask tokens pull the future forward

They append k special mask tokens to the prompt, ask the frozen network to fill them, then fine‑tune only small adapters. The trick makes the model treat the masks as “next X words” placeholders instead of blanks, producing X fresh tokens in one shot.

🪟 Gated LoRA keeps the old brain intact

Regular LoRA alters every forward pass and hurts accuracy. Their gated LoRA routes updates only through the masks, leaving standard next‑token paths untouched. A plot on page X shows accuracy staying flat with the gate while standard LoRA drifts downward.

⚡ Sampler head stitches a smooth phrase

Raw multi‑token logits can clash. A tiny 2‑layer MLP looks at the current hidden vector plus the token it just chose, nudging the next pick so the sentence flows. Because the MLP is external, the base model stays frozen and cheap to store.

📈 Speculative decoding without backtracking pain

Linear speculative decoding fails if any predicted word is wrong. They interleave masks between speculative words, a scheme they call quadratic decoding, so at least one new chunk is always verifiable next round. Acceptance rates jump, especially when k ≥ X.

🔬 Training - During 50K supervised‑fine‑tune steps:

Cross‑entropy teaches both regular and mask outputs.
A latent consistency loss pulls each mask’s hidden state toward the later true token, so masks imitate real autoregressive states.

Because gradients never touch non‑mask tokens, the base model’s original responses remain stable.

⏩ Speed gains you can measure

With X masks the model averages XXXX tokens per step, and on GSM8K math it rides up to XXXX tokens per step, a direct ≈5× wall‑clock gain. Coding tasks show similar numbers. General chat still lands a neat 2.5×, matching human‑quality scores.

🥡 Bottom line

The paper proves you can graft a tiny mask‑aware head onto an existing 8B model, keep quality, and cut inference time by up to 80%, all with a handful of extra parameters.

![](https://pbs.twimg.com/media/GwGjIHoWsAA20NB.png)

XXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1946018904993866009/c:line.svg)

**Related Topics**
[token](/topic/token)
[faster](/topic/faster)
[llm](/topic/llm)

[Post Link](https://x.com/rohanpaul_ai/status/1946018904993866009)