[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  Andrej Karpathy [@karpathy](/creator/twitter/karpathy) on x 1.3M followers Created: 2025-06-20 21:18:41 UTC Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if XXX% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model? Curious what the most powerful e.g. 1B param model trained on a dataset of 10B tokens looks like, and how far "micromodels" can be pushed. As an example, (text)books are already often included in pretraining data mixtures but whenever I look closely the data is all messed up - weird formatting, padding, OCR bugs, Figure text weirdly interspersed with main text, etc. the bar is low. I think I've never come across a data stream that felt *perfect* in quality. XXXXXXX engagements  **Related Topics** [1b](/topic/1b) [llm](/topic/llm) [Post Link](https://x.com/karpathy/status/1936171874398208202)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
Andrej Karpathy @karpathy on x 1.3M followers
Created: 2025-06-20 21:18:41 UTC
Mildly obsessed with what the "highest grade" pretraining data stream looks like for LLM training, if XXX% of the focus was on quality, putting aside any quantity considerations. Guessing something textbook-like content, in markdown? Or possibly samples from a really giant model? Curious what the most powerful e.g. 1B param model trained on a dataset of 10B tokens looks like, and how far "micromodels" can be pushed.
As an example, (text)books are already often included in pretraining data mixtures but whenever I look closely the data is all messed up - weird formatting, padding, OCR bugs, Figure text weirdly interspersed with main text, etc. the bar is low. I think I've never come across a data stream that felt perfect in quality.
XXXXXXX engagements
/post/tweet::1936171874398208202