Dark | Light
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![karpathy Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::33836629.png) Andrej Karpathy [@karpathy](/creator/twitter/karpathy) on x 1.3M followers
Created: 2025-06-20 21:18:41 UTC

Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if XXX% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model? Curious what the most powerful e.g. 1B param model trained on a dataset of 10B tokens looks like, and how far "micromodels" can be pushed.

As an example, (text)books are already often included in pretraining data mixtures but whenever I look closely the data is all messed up - weird formatting, padding, OCR bugs, Figure text weirdly interspersed with main text, etc. the bar is low. I think I've never come across a data stream that felt *perfect* in quality.


XXXXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1936171874398208202/c:line.svg)

**Related Topics**
[1b](/topic/1b)
[llm](/topic/llm)

[Post Link](https://x.com/karpathy/status/1936171874398208202)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

karpathy Avatar Andrej Karpathy @karpathy on x 1.3M followers Created: 2025-06-20 21:18:41 UTC

Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if XXX% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model? Curious what the most powerful e.g. 1B param model trained on a dataset of 10B tokens looks like, and how far "micromodels" can be pushed.

As an example, (text)books are already often included in pretraining data mixtures but whenever I look closely the data is all messed up - weird formatting, padding, OCR bugs, Figure text weirdly interspersed with main text, etc. the bar is low. I think I've never come across a data stream that felt perfect in quality.

XXXXXXX engagements

Engagements Line Chart

Related Topics 1b llm

Post Link

post/tweet::1936171874398208202
/post/tweet::1936171874398208202