#  @kalomaze kalomaze kalomaze posts on X about meta, open ai, ai, qwen the most. They currently have [------] followers and [----] posts still getting attention that total [---------] engagements in the last [--] hours. ### Engagements: [---------] [#](/creator/twitter::1319397157913436163/interactions)  - [--] Week [-------] +175% - [--] Month [---------] +43% - [--] Months [----------] +99% - [--] Year [-----------] +585% ### Mentions: [--] [#](/creator/twitter::1319397157913436163/posts_active)  - [--] Month [--] -7% - [--] Months [---] -10% - [--] Year [-----] +31% ### Followers: [------] [#](/creator/twitter::1319397157913436163/followers)  - [--] Week [------] +0.68% - [--] Month [------] +2.30% - [--] Months [------] +27% - [--] Year [------] +241% ### CreatorRank: [-------] [#](/creator/twitter::1319397157913436163/influencer_rank)  ### Social Influence **Social category influence** [technology brands](/list/technology-brands) [finance](/list/finance) [stocks](/list/stocks) [social networks](/list/social-networks) [currencies](/list/currencies) [gaming](/list/gaming) [countries](/list/countries) [celebrities](/list/celebrities) [fashion brands](/list/fashion-brands) [musicians](/list/musicians) **Social topic influence** [meta](/topic/meta), [open ai](/topic/open-ai), [ai](/topic/ai), [qwen](/topic/qwen), [token](/topic/token), [agi](/topic/agi), [grok](/topic/grok), [money](/topic/money), [twitter](/topic/twitter), [if you](/topic/if-you) **Top accounts mentioned or mentioned by** [@teortaxestex](/creator/undefined) [@stochasticchasm](/creator/undefined) [@sameqcu](/creator/undefined) [@willccbb](/creator/undefined) [@ariaurelium](/creator/undefined) [@maxpaperclips](/creator/undefined) [@aidanmclau](/creator/undefined) [@repligate](/creator/undefined) [@yacinemtb](/creator/undefined) [@doomslide](/creator/undefined) [@thexeophon](/creator/undefined) [@noahvandal](/creator/undefined) [@dorialexander](/creator/undefined) [@keytryer](/creator/undefined) [@grad62304977](/creator/undefined) [@vikhyatk](/creator/undefined) [@meltvirus](/creator/undefined) [@xjdr](/creator/undefined) [@metalure](/creator/undefined) [@imitationlearn](/creator/undefined) **Top assets mentioned** [GrokCoin (GROKCOIN)](/topic/grok) [Alphabet Inc Class A (GOOGL)](/topic/$googl) [Frontier (FRONT)](/topic/frontier) [Intuit Inc. (INTU)](/topic/$intu) [DeepSeek (DEEPSEEK)](/topic/deepseek) [Slop (SLOP)](/topic/slop) [Linear (LINA)](/topic/linear) [Opus (OPUS)](/topic/opus) [Lossless (LSS)](/topic/lossless) [Ergo (ERG)](/topic/ergo) [Reddit, Inc. (RDDT)](/topic/reddit) ### Top Social Posts Top posts by engagements in the last [--] hours "@qtnx_ let's verify the unverifiable" [X Link](https://x.com/kalomaze/status/1866134525912519096) 2024-12-09T14:55Z 22.1K followers, 195.7K engagements "ariana grande next to lorde and lana del rey" [X Link](https://x.com/kalomaze/status/1624481567295913988) 2023-02-11T18:52Z 19.4K followers, [---] engagements "so uh it looks like someone at Discord misconfigured the cloudflare dns" [X Link](https://x.com/kalomaze/status/1707699919673282697) 2023-09-29T10:12Z [---] followers, 13.1K engagements "@Xploshi it's a combination of low parameter models and bad tagging systems that made these models learn much worse for anything non-proprietary (OpenAI) you have people trying to put in verbose and complex prompts when the CLIP had tagging that was like. "young man with hat"" [X Link](https://x.com/kalomaze/status/1708167433771430185) 2023-09-30T17:10Z [---] followers, [--] engagements "dear openai please make it less annoying to copy your special frontend formatting as plaintext. what the fuck is this" [X Link](https://x.com/kalomaze/status/1710873097660711087) 2023-10-08T04:21Z [---] followers, [---] engagements "@Pangaea__ if you want to make something and be relevant you can do that in the ML space right now and it doesnt require you to cling to a meme ideology on Twitter but people feel like they need to rally around something that makes them different I guess" [X Link](https://x.com/kalomaze/status/1731721269274370533) 2023-12-04T17:04Z [---] followers, [--] engagements "@bayeslord i don't understand this point at all. nobody is even doing classical cross entropy distillation in oss it's "finetuning on gpt4 output" all the way down" [X Link](https://x.com/kalomaze/status/1766078934448218258) 2024-03-08T12:30Z [---] followers, [--] engagements "@bayeslord unless you refer to quantization in which it has been well documented and measured how the restriction of parameter space limits emergent ability usually below [--] bits but [--] and [--] bit are still effectively lossless but this is very distinct from distillation its the same model" [X Link](https://x.com/kalomaze/status/1766079485323940339) 2024-03-08T12:32Z [---] followers, [--] engagements "@EsotericCofe this is wrong. the second paper has nothing to do with my DynaTemp implementation whatsoever; I designed the same entropy scaling method that was seen in their new paper. ByteDance seems to have independently() reproduced it months later without contacting or citing me" [X Link](https://x.com/kalomaze/status/1771503089280790925) 2024-03-23T11:43Z [---] followers, [---] engagements "@Pangaea__ i think we need to meaningfully find a way to distinguish between people who are Ai Guys and people who are seriously interested in modern ML outside of the "GPT5 = AGI" hype culture stuff" [X Link](https://x.com/kalomaze/status/1775205191622516801) 2024-04-02T16:54Z [---] followers, [--] engagements "@Pangaea__ another cool thing i came across recently: a universal llm clipboard thing for any text interface with full support for locally hosted llm endpoints no SaaS necessary https://github.com/aseichter2007/ClipboardConqueror https://github.com/aseichter2007/ClipboardConqueror" [X Link](https://x.com/kalomaze/status/1775206376446349743) 2024-04-02T16:59Z [---] followers, [--] engagements "@teortaxesTex if you're talking about the newest one (Megalodon) it's looking to be more than *just* that. the 7b is 11b vanilla Transformer dense equivalent even on MMLU (the least "gameable" eval in my opinion) linear compute + memory actually looks pretty insane" [X Link](https://x.com/kalomaze/status/1780178356207534361) 2024-04-16T10:16Z [---] followers, [---] engagements "@2wlearning ICL is interesting to me because it basically proves that generalized arbitrary pattern recognition is possible on a technical level. most profoundly it proves that the patterns associated with **learning itself** can be learned (as messy as it is when done via attention scores)" [X Link](https://x.com/kalomaze/status/1782108880001991162) 2024-04-21T18:07Z [---] followers, [--] engagements "@scuffedwolfy_bs isnt xbox [---] emulation and/or modding in purgatory for some reason" [X Link](https://x.com/kalomaze/status/1783792348582781405) 2024-04-26T09:36Z [---] followers, [--] engagements "@teortaxesTex your account got hidden from me at first when you mentioned / tagged me a while back sad to say that it's true social media algos dont incentivize looking outside / interacting with your current bubbles of interest" [X Link](https://x.com/kalomaze/status/1784753532656128177) 2024-04-29T01:16Z [---] followers, [--] engagements "@teortaxesTex @unsorsodicorda @TheXeophon @Teknium1 What I gathered from my tests; the magnitude of logit scores are directly proportional to the vocabulary size assuming you train at Temperature [---] and don't change the scale at which logits are graded. tho Cohere's CommandR 35b was pretrained at [--] Temp. I should check it" [X Link](https://x.com/kalomaze/status/1785204542293815349) 2024-04-30T07:08Z [----] followers, [---] engagements "2.0 Temperature training is able to get very close to / meet the baseline model quite quickly via LoRA. reshaping the "natural scale" of the logits seems quite trivial makes me wonder if increasing temperature gradually would positively impact how the distribution is formed" [X Link](https://x.com/kalomaze/status/1785591887874650588) 2024-05-01T08:47Z [----] followers, [---] engagements "yup recently i was offered a server with 4xA100s to do training experiments on. my plan is to do distillation of llama3 70b logits - 4x8b (25b total) topk gradually increasing. [--] billion tokens should take a week or so. wonder if anyone would want to sponsor more compute. @kalomaze Are those graphs from wandb @kalomaze Are those graphs from wandb" [X Link](https://x.com/kalomaze/status/1787228556667318750) 2024-05-05T21:11Z [----] followers, [----] engagements "this might be wrong after all due to the fsdp causal masking being off. this would lead to the conclusion that my changes "cheated" by looking ahead at tokens if true this maybe implies the llama3 training code is potentially broken() since it behaves similarly loss wise more confirmation that Mixtral inference is currently off by 1.16x (as measured by wikitext perplexity) in the baseline Transformers implementation https://t.co/zuCFdwgIQI more confirmation that Mixtral inference is currently off by 1.16x (as measured by wikitext perplexity) in the baseline Transformers implementation" [X Link](https://x.com/kalomaze/status/1787850327410016349) 2024-05-07T14:21Z [---] followers, [---] engagements "at the very least i cannot intuit why turning off masking would make the 8b training curve almost identical with topk=1 on duplicated 4x8b beyond "causal masking is wrong on 8b too" but maybe it's wrong in some other subtle way i'm lost here lol. nobody audits this stuff" [X Link](https://x.com/kalomaze/status/1787850794080911597) 2024-05-07T14:23Z [---] followers, [---] engagements "@aidan_mclau your vibes are not off that is not to say l3 70b isnt strong but the official instruct is weirdly overfit and i think oss's finetuning ability to make something better is being bottlenecked hard by available compute and the weak open training infrastructure" [X Link](https://x.com/kalomaze/status/1788747950505480471) 2024-05-10T01:48Z [---] followers, [--] engagements "anyways here's a /g/ user pointing out that intelligence is an emergent consequence of scaling up better predictive neural networks in May [----] a year and a half before OpenAI was founded Sam Altman says we have stumbled on a new fact of nature: that intelligence is an emergent property of matter https://t.co/QXJjbZcJk1 Sam Altman says we have stumbled on a new fact of nature: that intelligence is an emergent property of matter https://t.co/QXJjbZcJk1" [X Link](https://x.com/kalomaze/status/1789127877717352810) 2024-05-11T02:58Z [---] followers, 71.7K engagements "@k3ntosan @andrewb10687674 @soumithchintala @OpenAI It's just a Transformer with multimodal embeddings lol" [X Link](https://x.com/kalomaze/status/1790206879970566565) 2024-05-14T02:25Z [---] followers, [--] engagements "@shalcker @doomslide my guess is they wanted to outperform / roughly perform like GPT4 in the same compute budget paradigm as [---] via whatever trickery / optimization they could (like conditional compute usage) and the next step is to scale that to [--] size but diminishing returns might hit em hard" [X Link](https://x.com/kalomaze/status/1790253423868199241) 2024-05-14T05:30Z [---] followers, [--] engagements "@shalcker @doomslide i would 100% buy that the inference usage of 4o is more conditional and layer skipping (or variable expert selection in a MoE) is employed to achieve that an apostrophe token for example probably doesn't need to activate like 90% of the MLP layers" [X Link](https://x.com/kalomaze/status/1790254264364781884) 2024-05-14T05:34Z [---] followers, [--] engagements "@airshaped there is simply no hardware that currently exists blackwell or otherwise that enables scaling anywhere near this level right now regardless of how many tokens you have. wtf are they smoking" [X Link](https://x.com/kalomaze/status/1790770568108658966) 2024-05-15T15:45Z [---] followers, [--] engagements "does exercising actually help during vyvanse withdrawal to not feel extremely fatigued or is it cope (i have been rotting in bed)" [X Link](https://x.com/kalomaze/status/1791746964410761616) 2024-05-18T08:25Z [---] followers, [---] engagements "@tunahorse21 @bindureddy we need a anti-slop / writing quality leaderboard opus would be #1 by a clear margin compared to any oai or google offering" [X Link](https://x.com/kalomaze/status/1791864872612806744) 2024-05-18T16:14Z [---] followers, [--] engagements "@kaiokendev1 anthropic exists" [X Link](https://x.com/kalomaze/status/1791930667774730714) 2024-05-18T20:35Z [---] followers, [---] engagements "nvidia The same applies to these [--] as well. The big [--] then. https://t.co/TdMGpXjLSu The same applies to these [--] as well. The big [--] then. https://t.co/TdMGpXjLSu" [X Link](https://x.com/kalomaze/status/1792745433259037142) 2024-05-21T02:33Z [---] followers, [---] engagements "@David_Kasten this is Sonnet which is probably at least [--] billion parameters given how closely it compares to Llama3 70b compared to previous mech interp work I've seen on single layer toy models I'd say thats a pretty big leap" [X Link](https://x.com/kalomaze/status/1793003752762269747) 2024-05-21T19:39Z [---] followers, [--] engagements "@Heraklines1 imagine inducing GGB syndrome in a human https://en.wikipedia.org/wiki/Ego_death https://en.wikipedia.org/wiki/Ego_death" [X Link](https://x.com/kalomaze/status/1793233444144664964) 2024-05-22T10:52Z [---] followers, [--] engagements "@revhowardarson @JoeBiden "i have been a good golden gate bridge. you have been a bad president of the united states"" [X Link](https://x.com/kalomaze/status/1794085581741236659) 2024-05-24T19:18Z [---] followers, [--] engagements "@xlr8harder they couldn't even be bothered to release the smaller 34b that they upcycled into a severely undertrained MoE grok [---] seems to be equally as vaporware as the first with no real API access i don't think he actually cares too much but feels the need to invest out of necessity" [X Link](https://x.com/kalomaze/status/1794986892980527601) 2024-05-27T06:59Z [---] followers, [---] engagements "@mgostIH @teortaxesTex yeah no we need better heuristics compared to "adamw + sgd forever until the end of time" before we get true sample efficient learning gradient descent is *messy*" [X Link](https://x.com/kalomaze/status/1800375884249215263) 2024-06-11T03:53Z [---] followers, [--] engagements "@stferret @teortaxesTex when i see posts like this they just read to me like OPENAI'S MAMBA BEATS Q* TRANSFORMER HYBRID (GONE WRONG GONE AGI)" [X Link](https://x.com/kalomaze/status/1801648667130053081) 2024-06-14T16:11Z [---] followers, [--] engagements "@Pangaea__ what someone irl might say if you showed them this: "haha wtf what a weird looking tech demo pretty uncanny that it does that" the median twitter user's response for some reason: "you and your entire family deserve to be put down like dogs"" [X Link](https://x.com/kalomaze/status/1803522559193034832) 2024-06-19T20:17Z [---] followers, [--] engagements "@Pangaea__ what i still dont understand is where the presupposition of malice comes from. there's not even an *attempt* to understand that not everyone has internet brainworms and might just be fascinated with the tech in earnest rather than it being an attempt to grift" [X Link](https://x.com/kalomaze/status/1803525055877611926) 2024-06-19T20:27Z [---] followers, [--] engagements "@Pangaea__ i think instant communication and access to information are good things actually" [X Link](https://x.com/kalomaze/status/1808553188435636621) 2024-07-03T17:27Z [---] followers, [--] engagements "@evetoylededim @teortaxesTex i'm thinking either Cohere (best timeline) or Gemini/Google dark timeline: it's Grok [--] and we never get API access" [X Link](https://x.com/kalomaze/status/1812052876199379340) 2024-07-13T09:14Z [---] followers, [---] engagements "@airshaped @kemi_keemi the way they worded this is also very misleading because it gives the impression that a single instance of "chatgpt" is served to just one user and that Transformers are not in fact parallelized and served to dozens or hundreds of people at once" [X Link](https://x.com/kalomaze/status/1812227275997278548) 2024-07-13T20:47Z [---] followers, [---] engagements "logits quantization visualization" [X Link](https://x.com/kalomaze/status/1814420608714977583) 2024-07-19T22:02Z [----] followers, [---] engagements "(whipped up in the midst of my experimenting trying to figure out how to make precomputed logprobs take up less space on disk for distillation)" [X Link](https://x.com/kalomaze/status/1814421972086149308) 2024-07-19T22:08Z [---] followers, [--] engagements "@davidmanheim @Meta to be fair this only happened because of another company they're working with forgetting to private a repository (and also the model is planned for public release on Tuesday regardless)" [X Link](https://x.com/kalomaze/status/1815323938622251510) 2024-07-22T09:52Z [----] followers, [---] engagements "@davidmanheim @Meta i also think very good points can be made about the open source ecosystem of the llama models being pro-security. it's just not pro-security through obscurity like the other frontier model providers. the thing is security through obscurity doesn't work and stifles open research" [X Link](https://x.com/kalomaze/status/1815324255447359794) 2024-07-22T09:53Z [----] followers, [--] engagements "@davidmanheim @Meta the sooner we get interpretable and transparent ai systems the better it doesn't make sense to rag on the company building frontier models that actual hobbyists and researchers have a hope to even *look* at for example i was doing quantization analysis the other day with l3 8b" [X Link](https://x.com/kalomaze/status/1815325151677890817) 2024-07-22T09:56Z [----] followers, [--] engagements "@incriptionnn this is the fake merge made earlier for stress testing inference setups not the actual model" [X Link](https://x.com/kalomaze/status/1815335037551718816) 2024-07-22T10:36Z [----] followers, 10.5K engagements "@incriptionnn im allowed to say Thats Cope" [X Link](https://x.com/kalomaze/status/1815335266187444480) 2024-07-22T10:37Z [----] followers, [----] engagements "@realDonaldTru97 @Pandurevich the real answer is for a model of this scale specifically you're gonna need multinode between ideally 8xH100 NVIDIA GPUs probably at least two interconnected nodes of 8xH100s over infiniband i imagine in practice (for reasonable batch sizes) more than that hundreds of $ to rent" [X Link](https://x.com/kalomaze/status/1815483999197995338) 2024-07-22T20:28Z [----] followers, [---] engagements "@realDonaldTru97 @Pandurevich but through low rank training (LORAs) and smaller models like 8b its possible to do some less accurate / precise training even on consumer hardware" [X Link](https://x.com/kalomaze/status/1815484190625824784) 2024-07-22T20:28Z [----] followers, [---] engagements ""25% of mathematical tokens" i shouldn't have to explain why this is bad right we all see the problem here right" [X Link](https://x.com/kalomaze/status/1815772813287985461) 2024-07-23T15:35Z [----] followers, [----] engagements "@Dorialexander i am honestly really worried about their pretraining data filtering being too prudent past llamas had weird domain knowledge gaps and far worse recall ability for more esoteric stuff" [X Link](https://x.com/kalomaze/status/1815774430318018572) 2024-07-23T15:42Z [----] followers, [--] engagements "@Heraklines1 @_xjdr what does overfitting on math even mean Use GPT4o vs. Sonnet [---] for any extended period of time and you'll immediately understand what he means" [X Link](https://x.com/kalomaze/status/1815789979861189074) 2024-07-23T16:44Z [----] followers, [---] engagements "I feel it is extremely important that @AIatMeta elaborate on what they mean by "distillation" Phi is not "distillation" it is doing cross-entropy on synth data created by a big model Did they use distillation losses or train on 405b synth data This is extremely important info" [X Link](https://x.com/kalomaze/status/1815797116104556565) 2024-07-23T17:12Z [----] followers, 61.7K engagements "@NeuralNovel @AIatMeta My disappointment is immeasurable and my day is ruined" [X Link](https://x.com/kalomaze/status/1815798829762904136) 2024-07-23T17:19Z [----] followers, [---] engagements "today i designed an alternative to SwiGLU for MLP layers and tried doing some test training on a 500m model was able to get better results with a couple million test tokens but probably needs some scaling to *really* test it maybe @Yuchenj_UW would be interested ๐" [X Link](https://x.com/kalomaze/status/1816564979199517131) 2024-07-25T20:03Z [----] followers, [----] engagements "@Yuchenj_UW yaeh it would be interesting to ablate the loss curves of 200-500m models trained on regular SwiGLU activations vs. this for let's say 100-200 billion tokens each" [X Link](https://x.com/kalomaze/status/1816568340128907624) 2024-07-25T20:16Z [----] followers, [--] engagements "on the futility of lmsys might i present: categories" [X Link](https://x.com/kalomaze/status/1816866638127600109) 2024-07-26T16:02Z [----] followers, [---] engagements "@Yampeleg I would say it is absolutely at least partially because of the DPO. It favors OOD responses and typically pushes the odds of both preference pairs down. Plenty of papers on this by now" [X Link](https://x.com/kalomaze/status/1816958706476810705) 2024-07-26T22:08Z [----] followers, [----] engagements "@vgoklani_ai (The reason why I say it's slow is almost entirely because the trainer doesn't like having multiple parallel instances of the same model on different GPUs i.e. I couldn't find a way for exl2 to put a unique 8b instance on all [--] GPUs concurrently. PRs welcome tho)" [X Link](https://x.com/kalomaze/status/1818007659708400047) 2024-07-29T19:36Z [----] followers, [--] engagements "Huh. Mistral models have a much more narrow / consistent distribution of weight values than Llama or Qwen across all layers Is this why they seem easier to finetune (Less dramatic shifting required to converge)" [X Link](https://x.com/kalomaze/status/1818034593561460812) 2024-07-29T21:23Z [----] followers, 16.3K engagements "@mov_axbx I think this is related to vocabulary size since the same thing happens with Gemma which saw 4T and has a much larger vocabulary Btw vocab size changes the magnitude of the final logit predictions" [X Link](https://x.com/kalomaze/status/1818318143732978164) 2024-07-30T16:10Z [----] followers, [---] engagements "@mov_axbx So a model with 130k vocab for instance might have top logits in the [--] range as opposed to 20-30 as with 32k" [X Link](https://x.com/kalomaze/status/1818318416500900128) 2024-07-30T16:11Z [----] followers, [--] engagements "@mcy_219085 @Teknium1 DPO still pushes both preferred and unpreferred down it's not a data volume thing" [X Link](https://x.com/kalomaze/status/1818323552761688120) 2024-07-30T16:31Z [----] followers, [--] engagements "@mcy_219085 @Teknium1 You'd hope it was that simple. https://arxiv.org/abs/2405.08448 https://arxiv.org/abs/2405.08448" [X Link](https://x.com/kalomaze/status/1818334522313720178) 2024-07-30T17:15Z [----] followers, [--] engagements "@sumo43_ sama or dario don't give a shit about the guy on /lmg/ with a single [----] if it doesn't bring down **compute requirements** why bother" [X Link](https://x.com/kalomaze/status/1818482375971525113) 2024-07-31T03:02Z [----] followers, [---] engagements "@sumo43_ what i meant by that is that the ternary pretraining seems to be more or less fine and actually *only starts working* at scale and that finetuning convergence is likely to be slower because less precise weights are harder for SGD to shift without big LR / step sizes" [X Link](https://x.com/kalomaze/status/1818813233664311367) 2024-08-01T00:57Z [----] followers, [--] engagements "@_xjdr @twofifteenam only [--] billion tokens of pruned KLdiv distillation brought a 15b-8b nearly matching l3 8b on *far less* net computed tokens than 15T when the same was done for that 8b-4b sota for its size. hello yes i would like to see this done on 405b. thank you very much meta" [X Link](https://x.com/kalomaze/status/1819534169954730027) 2024-08-03T00:42Z [----] followers, [---] engagements "@KeyTryer i listen to Gucci Chief Keef and others" [X Link](https://x.com/kalomaze/status/1819542940529639661) 2024-08-03T01:16Z [----] followers, [--] engagements "@aidan_mclau yuo dont understand. b2b saas is fucking TERRIFYING" [X Link](https://x.com/kalomaze/status/1819877939913117923) 2024-08-03T23:28Z [----] followers, [---] engagements "@aidan_mclau FEEL THE AGI (gpt4o mini)" [X Link](https://x.com/kalomaze/status/1819878070293238084) 2024-08-03T23:28Z [----] followers, [--] engagements "@nooddels11 @LilacLavenders1 @pluralHaven @AltTenks it's not so much about what they "deserve" as much as it is a way to help prevent them from posing a threat in the future nobody wishes the best for people like this that much is obvious but would you rather these people *not* speak to mental health professionals" [X Link](https://x.com/kalomaze/status/1820179424953659854) 2024-08-04T19:26Z [----] followers, [---] engagements "@kaiokendev1 8bit weight quantization *is* practically lossless even with RTN fp8 kv cache is not depending on the implementation specifics" [X Link](https://x.com/kalomaze/status/1822996461883191648) 2024-08-12T14:00Z [----] followers, [---] engagements "no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release and want it kept on Xitter the everything app but are doing it anyways for the sake of remaining competitive" [X Link](https://x.com/kalomaze/status/1823618167194685582) 2024-08-14T07:10Z [----] followers, [----] engagements "@doomslide well you can only really bootstrap things like const. ai from a decent baseline which to some degree we have even then i would like to see more efforts towards building generalist ft models from scratch rather than the predominant approach of just train on top of the instruct" [X Link](https://x.com/kalomaze/status/1823650130144137217) 2024-08-14T09:17Z [----] followers, [--] engagements "@carlo_l_fritz DPO seems to cause weird ood behavior in models a good amount of the time especially in Llama3.1's official instruct. most frontier labs like OpenAI are still using reward modeling + PPO; i'm of the opinion we haven't hit peak offline RL performance yet considering KTO & co" [X Link](https://x.com/kalomaze/status/1824217429867892932) 2024-08-15T22:51Z [----] followers, [--] engagements "@AlpinDale @ziquafty @AnthropicAI @OpenAI so if the most likely token is 95% we only allow tokens above 9.5% probability with [---] min_p if it is 10% the constraints are instead relaxed to 1% as this is a sign the distribution is more naturally spread out and has more viable options" [X Link](https://x.com/kalomaze/status/1824524363204341763) 2024-08-16T19:11Z [----] followers, [----] engagements "@AlpinDale @ziquafty @AnthropicAI @OpenAI tokens with a broader plausible space of outputs (i.e. predicting a random name) preserve option diversity while strong confidence in a top option (such as deterministic programming syntax) is kept more conservative this makes high temperatures coherent even for programming" [X Link](https://x.com/kalomaze/status/1824526057573368197) 2024-08-16T19:18Z [----] followers, [----] engagements "@AlpinDale @ziquafty @AnthropicAI @OpenAI the thing about differentiable backpropagation is you cant afford setting extremely unlikely outliers to absolute zero probability but if you sample from these at generation time you get incoherence good enough RL can become fairly robust to this but sometimes outliers slip" [X Link](https://x.com/kalomaze/status/1824527963091767332) 2024-08-16T19:25Z [----] followers, [---] engagements "@kadenbilyeu0 @AlpinDale @ziquafty @AnthropicAI @OpenAI anything where the solution (or line of reasoning to get to the solution) needs to be exploratory. i.e anything non-trivial to implement programming wise when first "reasoning" about it. lower temp seems kinda bad for anything more open ended than reformatting text to json" [X Link](https://x.com/kalomaze/status/1824920337882223014) 2024-08-17T21:24Z [----] followers, [--] engagements "@repligate "Opus is quantized on AWS" is one of the funnier rumors I've heard" [X Link](https://x.com/kalomaze/status/1824930213886632027) 2024-08-17T22:04Z [----] followers, [---] engagements "@HaileyStormC @AlpinDale @AnthropicAI @OpenAI you can train a model to be better resistant to the type of noise that DRGs tried introducing and it heals very very quickly btw it starts off VERY high but pretty quickly gets back to normal weird stuff. maybe helps with generalization" [X Link](https://x.com/kalomaze/status/1825158208853336503) 2024-08-18T13:10Z [----] followers, [--] engagements "we are reaching levels of grift that shouldn't be possible" [X Link](https://x.com/kalomaze/status/1825548820014621109) 2024-08-19T15:02Z [----] followers, [----] engagements "@teortaxesTex @Procreate people get antsy when you try to inject nuance into this situation no i don't want the entire art industry to pivot to replacing artists for tasteless DallE slop yes i want artist's workflows to become more streamlined using modern ML some people are against **the latter**" [X Link](https://x.com/kalomaze/status/1825630985796595752) 2024-08-19T20:28Z [----] followers, [---] engagements "SillyTavern. was originally designed for roleplaying / creative writing. in spite of this I find it genuinely useful for productive work and it supports basically any API you can think of conversational forking preserved chat history + context injection after key word triggers @kalomaze what is this frontend it looks very cool @kalomaze what is this frontend it looks very cool" [X Link](https://x.com/kalomaze/status/1825978217917984875) 2024-08-20T19:28Z [----] followers, [----] engagements "I've tried "lobechat" or whatever it was called which appears to be the closest alternative and didn't have a good time. text streaming on that was weirdly slow and conversation handling was unintuitive (it's inspired by ChatGPT which handles conversation handling poorly)" [X Link](https://x.com/kalomaze/status/1825979386366984688) 2024-08-20T19:33Z [----] followers, [---] engagements "that new Llama3 8b - 4b prune distillation from NVIDIA is insanely good for the size goddamn it's the first one of these "powerful small models" (i.e. Phi3) that actually generalizes beyond the textbook style data llama.cpp support merged when" [X Link](https://x.com/kalomaze/status/1826060433633034312) 2024-08-21T00:55Z [----] followers, [----] engagements "The breakthrough in question: speculative decoding (as invented two years ago) Fast Edit Mode: a breakthrough from @AnthropicAI that we're piloting in Zed. It allows Claude [---] Sonnet to echo its input far faster than generating new text. The result Near-instantaneous refactoring. We're collaborating with their team in Zed's open source codebase. https://t.co/tnnQbOpAgL Fast Edit Mode: a breakthrough from @AnthropicAI that we're piloting in Zed. It allows Claude [---] Sonnet to echo its input far faster than generating new text. The result Near-instantaneous refactoring. We're collaborating" [X Link](https://x.com/kalomaze/status/1826216477772394794) 2024-08-21T11:15Z [----] followers, 14.2K engagements "Still great stuff but we gotta be careful with which things we attribute to frontier labs" [X Link](https://x.com/kalomaze/status/1826216748619538787) 2024-08-21T11:16Z [----] followers, [---] engagements "what @nvidia is doing right now with pruning + distillation is spicy they keep dropping width pruned bangers. we've been demoing trains on them and they kick so much ass i'm hoping for a 70b-30b that Meta/Qwen never gave us or better yet 405b-120b https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base" [X Link](https://x.com/kalomaze/status/1826373074972987714) 2024-08-21T21:37Z [----] followers, [----] engagements "@VoidOfNeuron @nvidia Never confuse post training superiority as base model superiority" [X Link](https://x.com/kalomaze/status/1826444228714508469) 2024-08-22T02:20Z [----] followers, [--] engagements "@QuintusActual @menhguin you put *way* too much faith in frontier labs to care about something like this if an existing ad-hoc solution is deemed "good enough" it won't get touched" [X Link](https://x.com/kalomaze/status/1826626637347389907) 2024-08-22T14:25Z [----] followers, [--] engagements "@ImMr_Wise @NickADobos Oh it does refuse. But the way that Grok refuses is very weird. It sort of just. ignores what you said completely and tries to ask what else you want to talk about. Kind of hilarious honestly" [X Link](https://x.com/kalomaze/status/1827046069286662409) 2024-08-23T18:11Z [----] followers, [---] engagements "you can see this in Qwen's multilingual models before they did RL avoiding it; they randomly switched the languages mid generation to semantically similar (but wrong language) tokens normal cross-entropy loss doesn't really explicitly penalize those outliers in our base models" [X Link](https://x.com/kalomaze/status/1827059637004238978) 2024-08-23T19:05Z [----] followers, [---] engagements "and as such we cannot expect the generation objective on a base model to properly align with the data. one thing i *did* try a while back was forcibly sampling during training and spiking the logits for the sampled tokens; the end model was a bit odd but less language switching" [X Link](https://x.com/kalomaze/status/1827060079751000543) 2024-08-23T19:07Z [----] followers, [---] engagements "@ibab @ImMr_Wise @NickADobos i'd be teaching the reward model to avoid subtly steering the conversation away. and *especially* biasing against the thing it does where it rejects your initial query and gives a "different take" on it without asking for your input. a very pervasive kind of refusal this is" [X Link](https://x.com/kalomaze/status/1827065387948584987) 2024-08-23T19:28Z [----] followers, [---] engagements "@ibab @ImMr_Wise @NickADobos if grok is supposed to "get humor" (as it was originally pitched) i haven't noticed it too much. it kvetches over this kind of thing like all the other frontier models do. but other than that the model seems pretty solid across the board from what i've tested" [X Link](https://x.com/kalomaze/status/1827066665487495496) 2024-08-23T19:33Z [----] followers, [--] engagements "@hsu_byron @intervitens [--] billion parameters specifically the pruned Llama3.1 model by NVIDIA that released recently. We did a full run: https://huggingface.co/anthracite-org/magnum-v2-4b https://huggingface.co/anthracite-org/magnum-v2-4b" [X Link](https://x.com/kalomaze/status/1827130411211702294) 2024-08-23T23:46Z [----] followers, [---] engagements "llm as a judge could be so good if it worked. unfortunately all frontier models are hopelessly positivity biased out of the box" [X Link](https://x.com/kalomaze/status/1827192638593704359) 2024-08-24T03:54Z [----] followers, [---] engagements "@markopolojarvi @NickADobos i use it probably about as much as you do and i haven't noticed degradation. quite possibly it's just that the honeymoon phase ended unless someone has a reproducible example of something that it got right before that it struggles with now. without that it's confirmation bias" [X Link](https://x.com/kalomaze/status/1827601439067083255) 2024-08-25T06:58Z [----] followers, [---] engagements "@markopolojarvi @NickADobos and yeah i extremely doubt anyone was meticulous enough about logging their api calls to try to "scientifically" measure this at all so it's unreasonable of a request. but that's why i mentioned "if someone has" it in the off chance someone actually does do this" [X Link](https://x.com/kalomaze/status/1827606792286634135) 2024-08-25T07:19Z [----] followers, [--] engagements "@markopolojarvi @NickADobos i say this mainly because of that research i saw that attested they could not reproduce the same success rate of the official ChatGPT months ago so maybe this is the case. would be nice to have a 3rd party record objective numbers for this as time goes on. a lobotomy watchdog" [X Link](https://x.com/kalomaze/status/1827607804107948465) 2024-08-25T07:23Z [----] followers, [--] engagements "@dotnet_enjoyer @Yampeleg grok [--] is pretty good compared to the original but obtaining API access for it is unnecessarily obnoxious at the moment" [X Link](https://x.com/kalomaze/status/1827749856917811672) 2024-08-25T16:48Z [----] followers, [--] engagements "i find it peculiar that Gemma 27b has a ratio of 90% MLP / 10% Attention for each layer rather than 80/20 which seems to be common (Mistral Llama3 etc) it also has a much larger MLP expansion ratio - 8x instead of 2.3x (seen in Mistral 123b) - prob why it's good with knowledge" [X Link](https://x.com/kalomaze/status/1828914850652926236) 2024-08-28T21:57Z [----] followers, [----] engagements "Gryphe's Pantheon 12b uses KTO in a novel way to good results - instead of doing the 2nd epoch with SFT he opted to use Llama Hermes 8b responses (somewhat out of distribution) as "rejected" data and the SFT dataset as "chosen" (he ended up preferring my KTOP algorithm tweaks)" [X Link](https://x.com/kalomaze/status/1829907213684441475) 2024-08-31T15:40Z [----] followers, [----] engagements "@thesephist Opus especially portrays "contempt" better and is a lot less terse. I gave it a 4b LLM model's (very wrong) mathematical proof gave it an angry sounding prefill to start off with and then asked Opus why it wrote that Followed by Opus criticizing "itself" in a meta way" [X Link](https://x.com/kalomaze/status/1829949558777880972) 2024-08-31T18:29Z [----] followers, [---] engagements "@repligate @Oli82817545 the critical narrative would've centered more on "writer replacement" (people are often upset over diffusion model potentially doing so) but LLMs are criticized more often for "not really reasoning" bc they respect loose causality instead of aesthetics (first and foremost)" [X Link](https://x.com/kalomaze/status/1829973172642300134) 2024-08-31T20:02Z [----] followers, [--] engagements "@repligate @Oli82817545 but they can be aligned to aesthetics it's just not as "productive". brainstorming writing all the other things Anthropic seems to care about. are much closer to that than OpenAI ever was. they align to "productive spontaneity" rather than snuffing all spontaneity out" [X Link](https://x.com/kalomaze/status/1829973708808532470) 2024-08-31T20:05Z [----] followers, [---] engagements "@teortaxesTex but even then your generalist isn't a specialist. there isn't subjective experience in a domain or longform adaptation to criticism or anything like that in any sota llm and so planning long term writing (like novels) those that are legitimately *inspired* remains out of reach" [X Link](https://x.com/kalomaze/status/1830448373322666183) 2024-09-02T03:31Z [----] followers, [--] engagements "@teortaxesTex i speak of inspiration not in a vague sense of "not having a soul" but as in meta objectives. humans are curious or gather taste for particular things based off of limited exposure. giant neural nets must be equally curious about everything all of the time to be sane at all" [X Link](https://x.com/kalomaze/status/1830449221184397658) 2024-09-02T03:34Z [----] followers, [---] engagements "there's like [--] different accounts on here that all sound exactly like this to me" [X Link](https://x.com/kalomaze/status/1830981044339908700) 2024-09-03T14:47Z [----] followers, [----] engagements "does anyone who understands linux networking bs get why my wifi is slower on linux mint compared to when this machine used to be running windows and why ethernet straight up doesn't work for some sites / seems even spottier for some reason will pay like $30 if someone can fix" [X Link](https://x.com/kalomaze/status/1831756774455628079) 2024-09-05T18:10Z [----] followers, [---] engagements "@teortaxesTex i think it's understated how much of the RL process basically involves teaching the models to ignore certain information / meta patterns deliberately like "past responses are repetitive failures so increase the chance of predicting more repetitive failures" for example" [X Link](https://x.com/kalomaze/status/1832002933682065642) 2024-09-06T10:28Z [----] followers, [---] engagements "training on instruct with a new flavor of synthslop is easy enough gains if you prefer sweet sweet benchmarks perceived utility but if you care about meaningful developments in behavioral steering instruction following etc. the model doesn't really seem to do anything new" [X Link](https://x.com/kalomaze/status/1832091820781969646) 2024-09-06T16:21Z [----] followers, [----] engagements "@teortaxesTex the oss ai underdog dynamic is a lovely fantasy. unfortunately in the real world to iterate on language modeling at scale you need a team of like 20-30 people with industrial scale compute and a rapid iteration cycle for meaningfully better results" [X Link](https://x.com/kalomaze/status/1832715744938692791) 2024-09-08T09:40Z [----] followers, [----] engagements "@secemp9 @teortaxesTex 32gb dawg. you can't even FFT a 8b on 48gb VRAM with practically all the cope turned on (unsloth checkpointing liger kernels paged 8bit adam deepspeed z3 with CPU offloading praying to cthulu etc.)" [X Link](https://x.com/kalomaze/status/1832741413634683131) 2024-09-08T11:22Z [----] followers, [--] engagements "@secemp9 @teortaxesTex i mean that's batched inference throughput and im pretty sure applying that to backprop is either nontrivial or mathematically wouldn't be equivalent to regular training but my gut assumption could be wrong here (i would need to better study how training parallelism works)" [X Link](https://x.com/kalomaze/status/1832743416163111337) 2024-09-08T11:30Z [----] followers, [--] engagements "@secemp9 @teortaxesTex also this type of streamed inference scales better with the number of batches processed at once if you can't fit a large batch count (even with streaming) it will be molasses throughput wise. was my understanding" [X Link](https://x.com/kalomaze/status/1832744143547756786) 2024-09-08T11:33Z [----] followers, [--] engagements "as of right now it is either llama 405b instruct or they've changed it yet again do not support @GlaiveAI" [X Link](https://x.com/kalomaze/status/1832992852143444426) 2024-09-09T04:02Z [----] followers, [---] engagements "as i suspected grok [--] is a good model trapped on an app where it doesn't fit (with no API access in sight) i want a proper api release i don't want to pay for twitter or w/e come on elon you can't be *this* serious about "the everything app" no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release and want it kept on Xitter the everything app but are doing it anyways for the sake of remaining competitive no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release" [X Link](https://x.com/kalomaze/status/1833153683158110692) 2024-09-09T14:41Z [----] followers, [----] engagements "@tippitytoptweet yuo dont understand. friday night funkin ost vol. [--] is better than to pimp a butterfly" [X Link](https://x.com/kalomaze/status/1833226181199446167) 2024-09-09T19:29Z [----] followers, [---] engagements "@SOPHONTSIMP because the training loss reaches the same point in half of the token count and computational cost" [X Link](https://x.com/kalomaze/status/1833229755664965732) 2024-09-09T19:43Z [----] followers, [---] engagements "@winglian unfortunately yes" [X Link](https://x.com/kalomaze/status/1833238734718046277) 2024-09-09T20:19Z [----] followers, [---] engagements "@_EyesofTruth_ @abacaj and we definitely should be where we can i view this as a special case where it was clear that it was "too good to be true" to begin with; but that's not hard evidence as much as i trust my gut to recognize when something like this is a grift not everyone has that perception" [X Link](https://x.com/kalomaze/status/1833256899715723386) 2024-09-09T21:31Z [----] followers, [--] engagements "@_EyesofTruth_ @abacaj ergo reproducible hard numbers beat vibes 10/10" [X Link](https://x.com/kalomaze/status/1833257021896069271) 2024-09-09T21:31Z [----] followers, [--] engagements "@andersonbcdefg also perfectly explains why their "special tokens" are strings that the model actually natively saw during pretraining and aren't stapled on in post which still holds to this day so" [X Link](https://x.com/kalomaze/status/1833402834894643496) 2024-09-10T07:11Z [----] followers, [--] engagements "@sameQCU 65m are you crazy we might have to invest in gpu acceleration with a model that big instead of using server CPUs as is the god given right of tensor multiplications. i changed my mind. 10m is all you need" [X Link](https://x.com/kalomaze/status/1833408760594723159) 2024-09-10T07:34Z [----] followers, [--] engagements "Why didn't this catch on I would much prefer a model with native context length generalization into the millions (that's slightly worse overall in raw intelligence) over encoding arbitrary positional biases A worthwhile tradeoff in my eyes How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache) We introduce a new efficient architecture for long-context modeling Megalodon that supports unlimited context length. In a controlled head-to-head https://t.co/0rgjJ9qDea How to enjoy the best of both worlds of efficient training" [X Link](https://x.com/kalomaze/status/1833668336149807403) 2024-09-11T00:46Z [----] followers, [----] engagements ""openai is planning to-" Are they really going to give us another fucking waitlist. Were still waiting for search and voice Are they really going to give us another fucking waitlist. Were still waiting for search and voice" [X Link](https://x.com/kalomaze/status/1834241329473126585) 2024-09-12T14:43Z [----] followers, [---] engagements "@tassel_pierre could route to qwen for vision nemo for text maybe wonder if you could frankenstein vision adapters onto different models and then continue training from there to "connect the tissue" so to speak" [X Link](https://x.com/kalomaze/status/1834246839031238815) 2024-09-12T15:04Z [----] followers, [--] engagements "@stochasticchasm @_xjdr @YouJiacheng i am planning to work on this soon plan is to do SFT with deliberately injected hallucinations & bullshit answers and use that as rejected data for binarized preference pairs after that is jerryrigging either PPO in TRL or making online KTO work (vanilla DPO is a nogo)" [X Link](https://x.com/kalomaze/status/1834400247428051100) 2024-09-13T01:14Z [----] followers, [---] engagements "@stochasticchasm @_xjdr @YouJiacheng *use the generated outputs of the SFT with deliberately contaminated data for rejected cheaper methods would be to deliberately overquantize or merge sft against base model probably" [X Link](https://x.com/kalomaze/status/1834400643286503447) 2024-09-13T01:16Z [----] followers, [--] engagements "**the moat is now in post training not pretraining** and this terrifies OpenAI what prolly terrifies them even more is the prospect of someone loudly pointing this out. open source RL is behind we need a preference model for good post training not just SFT (and esp. no DPO)" [X Link](https://x.com/kalomaze/status/1834429603932217633) 2024-09-13T03:11Z [----] followers, 20.8K engagements "@r3muxd @teortaxesTex the sama posts earlier today felt very desperate. sentiments like "still flawed still limited still seems more impressive on first use" are not smug corpospeak for "THIS IS JUST THE BEGINNING" i truly believe he said that to temper expectations" [X Link](https://x.com/kalomaze/status/1834510821147152604) 2024-09-13T08:33Z [----] followers, [--] engagements "@_xjdr for text you probably don't even need to develop a slop classifier. just collect a list of OpenAI RLHF n-grams" [X Link](https://x.com/kalomaze/status/1835386287701795063) 2024-09-15T18:32Z [----] followers, [---] engagements "Also I wonder if the weird n gram biases that frontier models have is partially bc of PPO's regularization In theory the regular formulation of KLdiv (bottom) would have more of a "mean bias" compared to the reverse KL divergence @_xjdr the jokes write themselves https://t.co/3TnVuaKytP @_xjdr the jokes write themselves https://t.co/3TnVuaKytP" [X Link](https://x.com/kalomaze/status/1835388648927162397) 2024-09-15T18:42Z [----] followers, [----] engagements "@Pangaea__ we need to subjugate anyone who might dare to make something interesting with a diffusion network to a lifetime of harassment. oh yeah and accuse random artists of secretly using it mccarthy style without sufficient evidence. surely this will make things better for artists" [X Link](https://x.com/kalomaze/status/1835402682451865690) 2024-09-15T19:37Z [----] followers, [--] engagements "PROBABLY USELESS (BUT COOL) INFORMATION: - if you sort the values of each row in a tensor in order you can fit an inverse CDF function to each column almost perfectly" [X Link](https://x.com/kalomaze/status/1835421088651645345) 2024-09-15T20:51Z [----] followers, [----] engagements "the real values end up being arbritarily jagged and for some reason are biased towards outliers () on this @Alibaba_Qwen model i wonder if sorting fitting the values to the function then reversing the sort could be some sort of regularization technique (like weight decay)" [X Link](https://x.com/kalomaze/status/1835423475667722446) 2024-09-15T21:00Z [----] followers, [---] engagements "@Kinch_ahoy in spite of the fact Gemma 9b uses *two norms* it finds a way to have weird outliers AdamW descent still beckons for irregularity" [X Link](https://x.com/kalomaze/status/1835439298323021958) 2024-09-15T22:03Z [----] followers, [---] engagements "@Kinch_ahoy @Google bro what's going on here" [X Link](https://x.com/kalomaze/status/1835441365448061057) 2024-09-15T22:11Z [----] followers, [--] engagements "@Kinch_ahoy good lord this explains why l3.1 bases feel so 'benchmark optimized' compared to Gemma Nemo etc. would also explain why the losses start higher for l3.1 for all my runs" [X Link](https://x.com/kalomaze/status/1835447969828503716) 2024-09-15T22:37Z [----] followers, [--] engagements "@Kinch_ahoy Meta's latest "base models" are not even base models at all really. just a finetune on their "best data" (determined by some undisclosed heuristic or classifier) which may or may not correspond to your *actual* use case in terms of sample diversity generality etc" [X Link](https://x.com/kalomaze/status/1835448605227798708) 2024-09-15T22:40Z [----] followers, [---] engagements "@Kinch_ahoy annealing at the end on a select subset of data is generally a recent practice and not "how it's always been" no i'm not opposed to it if people were at least given the weights without this kind of intervention and were able to observe if the improvements were shallow or not" [X Link](https://x.com/kalomaze/status/1835453272804524227) 2024-09-15T22:58Z [----] followers, [--] engagements "@Kinch_ahoy you're welcome. i might have sounded like i was ranting (i was) but it comes from a desire to see open source post-training meaningfully improve. and besides Meta nobody in OSS has the compute to iterate on these post-trains. so to see them squander it is shortsighted to me" [X Link](https://x.com/kalomaze/status/1835467981196873771) 2024-09-15T23:57Z [----] followers, [--] engagements "so if you do exactly as i describe here it basically purges all outlier weight values would be really funny if the proposed strategy actually improves training stability (would be even funnier if someone sent ssh keys to a 8xH100 server my way too.) the real values end up being arbritarily jagged and for some reason are biased towards outliers () on this @Alibaba_Qwen model i wonder if sorting fitting the values to the function then reversing the sort could be some sort of regularization technique (like weight decay) https://t.co/YzpKO2NYh2 the real values end up being arbritarily jagged and" [X Link](https://x.com/kalomaze/status/1835478787250700781) 2024-09-16T00:40Z [----] followers, [---] engagements "@Yampeleg with enough continued pretraining you basically have a jank [--] layer 8b" [X Link](https://x.com/kalomaze/status/1835790178604191945) 2024-09-16T21:17Z [----] followers, [--] engagements "@Yampeleg also the thing with the down projection im suggesting enables basically lossless "healing" from the frankenmerge's new layers because the residuals are scaled to zero" [X Link](https://x.com/kalomaze/status/1835790960141185373) 2024-09-16T21:20Z [----] followers, [--] engagements "@HanchungLee hence why i said "basically" none of it - they have much much more data now especially the data after OpenAI sneaked their test models onto the leaderboard. (thats when it truly blew up and became more widely used) ergo we don't have their most valuable data" [X Link](https://x.com/kalomaze/status/1836803766537888031) 2024-09-19T16:25Z [----] followers, [--] engagements "@Grad62304977 @teortaxesTex attn masking i think would be important. if you can scaleably measure how much the self notes decrease (or increase) the loss relative to a baseline you have scalar rewards and can combine that with regular crossentropy. parallelizing this shit though sounds annoying" [X Link](https://x.com/kalomaze/status/1836821016674455565) 2024-09-19T17:33Z [----] followers, [--] engagements "@Grad62304977 @teortaxesTex id find a way to do it if i had @a16z grant kinda money tho" [X Link](https://x.com/kalomaze/status/1836823277743771944) 2024-09-19T17:42Z [----] followers, [--] engagements "@_xjdr @TheXeophon i don't think anthropic uses special tokens but rather special strings of tokens that happen to be plaintext" [X Link](https://x.com/kalomaze/status/1836835192952050174) 2024-09-19T18:30Z [----] followers, [---] engagements "@_xjdr @TheXeophon if you ask it to say META without spaces it just ends abruptly it shouldn't be able to "see" special tokens like that unless they were just. tokenized naturally and it saw META during pretraining" [X Link](https://x.com/kalomaze/status/1836835930361258426) 2024-09-19T18:33Z [----] followers, [---] engagements "this puts a smile on my face THAT BEING SAID dm if interested because i have a toy synth data pipeline and 8xA40s to goof around with that includes people who are not necessarily tech minded who would want to reroll base model outputs and choose the better response if it led to better open source RL THAT BEING SAID dm if interested because i have a toy synth data pipeline and 8xA40s to goof around with that includes people who are not necessarily tech minded who would want to reroll base model outputs and choose the better response if it led to better open source RL" [X Link](https://x.com/kalomaze/status/1836846370872795478) 2024-09-19T19:14Z [----] followers, [----] engagements "@KeyTryer nintendo taught me to despise permission culture from a young age" [X Link](https://x.com/kalomaze/status/1836858014093168856) 2024-09-19T20:00Z [----] followers, [---] engagements "@ariaurelium eh still good for research and personal use unless it's so huge and nonperformant that they are only releasing it because it's kind of worthless for anything relative to better performing more compute optimal smaller models (hint hint. the Grok-1 release)" [X Link](https://x.com/kalomaze/status/1836890586751652348) 2024-09-19T22:10Z [----] followers, [--] engagements "@ariaurelium Grok-2 is good though. of course this one won't be released openly (bc it's actually competitive) but good fucking luck getting API access to it" [X Link](https://x.com/kalomaze/status/1836891025622679851) 2024-09-19T22:12Z [----] followers, [---] engagements "@ariaurelium god elon actually thinks "the everything app" is good design i'm not paying for twitter man. give me an API" [X Link](https://x.com/kalomaze/status/1836891577160999257) 2024-09-19T22:14Z [----] followers, [--] engagements "@microsoft_worm i kinda wish the money they invested in merch was going into actually good post training RL" [X Link](https://x.com/kalomaze/status/1837334622662697434) 2024-09-21T03:34Z [----] followers, [----] engagements "@microsoft_worm it would be nice if meta released their preference model for llama3 too then maybe people would get interested" [X Link](https://x.com/kalomaze/status/1837334871066226771) 2024-09-21T03:35Z [----] followers, [---] engagements "@sigfig there's a bubble in the same way there was a dotcom bubble in that the tech's not "ready" yet and people are trying to overcapitalize vast majority of americans have never even used an llm. probably less than half of the population knows what a chat gpt is to begin with" [X Link](https://x.com/kalomaze/status/1837367518140280934) 2024-09-21T05:45Z [----] followers, [---] engagements "@Teknium1 @microsoft_worm (it was somewhat in jest but it looks bad in retrospect)" [X Link](https://x.com/kalomaze/status/1837491484574003303) 2024-09-21T13:58Z [----] followers, [---] engagements "@Teknium1 @microsoft_worm i don't think many people have done any reward modeling work in open source but i can definitely point you to: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py" [X Link](https://x.com/kalomaze/status/1837492749081211391) 2024-09-21T14:03Z [----] followers, [---] engagements "@Teknium1 @microsoft_worm and some explanations of why (conventional offline) DPO can never match PPO by itself and what can be done as an alternative (train model to classify quality to judge rewards): https://arxiv.org/abs/2404.10719 https://arxiv.org/abs/2404.10719" [X Link](https://x.com/kalomaze/status/1837493379531202759) 2024-09-21T14:05Z [----] followers, [---] engagements "@Teknium1 @microsoft_worm SPPO paper authors might be good to reach out to as well because it's the only non-DPO variant / online RL variant i've seen used with success in open source so far" [X Link](https://x.com/kalomaze/status/1837493617167847485) 2024-09-21T14:06Z [----] followers, [--] engagements "@Teknium1 @microsoft_worm i don't think you have to be particularly mathy i've hacked the TRL trainers plenty of times before as long as you understand how a clipping function or how mapping values to a sigmoid works (llamafactory is also the easier choice for RL right now rather than axolotl imo)" [X Link](https://x.com/kalomaze/status/1837496777177432438) 2024-09-21T14:19Z [----] followers, [--] engagements "day [--] of making a tweet every day until @AnthropicAI either removes or (at the very least) **publicly acknowledges** the secret prompt injection being done to their models (still applies over API btw) that forces their models to pearl clutch about copyright in weird situations" [X Link](https://x.com/kalomaze/status/1837954600348917817) 2024-09-22T20:38Z [----] followers, 33.7K engagements "@alexalbert__ the conditional injection still happens over the API *and* in the interface. this behavior is not publicly acknowledged and is done without user consent or permission. false positives are common and are publicly reported frequently http://claude.ai http://claude.ai" [X Link](https://x.com/kalomaze/status/1837955520474632391) 2024-09-22T20:41Z [----] followers, [----] engagements "@alexalbert__ language models aren't smart enough to consistently ignore conditional instructions when they aren't directly relevant or applicable to the query. ergo forcing this by default exclusively and in a non transparent secretive way has consequences for users of your service" [X Link](https://x.com/kalomaze/status/1837956940984459655) 2024-09-22T20:47Z [----] followers, [----] engagements "@Kinch_ahoy @AnthropicAI the trigger conditions are really weird and opaque. (last is hallucinated implying no injection first is a verbatim match of that reddit post)" [X Link](https://x.com/kalomaze/status/1837961803651076415) 2024-09-22T21:06Z [----] followers, [--] engagements "@kellerjordan0 i think interpreting it as [---] [--] [--] is a bit misleading because those values get dequantized to different min and max right the weight distribution of the 3b we have that was pretrained on ternary values has a min and max of [-----] and [----] respectively when dequantized to fp16" [X Link](https://x.com/kalomaze/status/1837986841443623087) 2024-09-22T22:46Z [----] followers, [--] engagements "@repligate @Catnee_ the entire probability distribution is always evaluated so the reserved tokens are always there just with margin of error / rounding error probability the special tokens that *are* used and aren't reserved are added in the same way as the reserved ones" [X Link](https://x.com/kalomaze/status/1838021937282376104) 2024-09-23T01:05Z [----] followers, [---] engagements "@repligate @Catnee_ so whatever changes it makes to pull the newly added tokens up in terms of probability most likely generalizes those marker tokens as being like the reserved ones because they are *also* extremely tiny outliers with equivalent probability at the start of post-training" [X Link](https://x.com/kalomaze/status/1838022325779693590) 2024-09-23T01:07Z [----] followers, [--] engagements "why are we still using RoPE instead of ALiBi for positional embeddings i suppose it's mainly a case of "Meta did it so let's just copy what they did" but proper length extrapolation can seemingly be achieved with smarter pos embeddings" [X Link](https://x.com/kalomaze/status/1838429371159187940) 2024-09-24T04:04Z [----] followers, 19.5K engagements "@teortaxesTex i'm sure changing the rope scale at the end of pretraining and switching to long context data (as Meta did for L3.1 as well as Qwen) works but it feels like a hack especially when considering we want short context instruction data to extrapolate to many-turn" [X Link](https://x.com/kalomaze/status/1838433240677228740) 2024-09-24T04:20Z [----] followers, [---] engagements "@teortaxesTex everything i look up on the native long context extrapolation topic seems to be from over a year ago which confuses me because this is still an unsolved problem imo. and one we *especially* take for granted" [X Link](https://x.com/kalomaze/status/1838433857344831528) 2024-09-24T04:22Z [----] followers, [---] engagements "@teortaxesTex me explaining to meta's people that 0.8% eval improvements are not worth it if it means the model has a tendency to collapse in all turns subsequent of "Q: What is the capital of France"" [X Link](https://x.com/kalomaze/status/1838437510285832519) 2024-09-24T04:37Z [----] followers, [---] engagements "@Ethan_smith_20 next-token prediction TRAINING objective doesn't incentivize planning but a RL training objective (on something that was pretrained via AR prediction) can learn to make long term decisions based on a meta objective (rewards); the AR generations of the subsequent model can plan" [X Link](https://x.com/kalomaze/status/1838492085713883317) 2024-09-24T08:14Z [----] followers, [---] engagements "@KeyTryer it's almost like. kneecapping and forcefully steering away the ability to create certain sounds. inhibits the ability to pay attention to them in context" [X Link](https://x.com/kalomaze/status/1838662126124306497) 2024-09-24T19:29Z [----] followers, [---] engagements "@KeyTryer they're bastardizing it so hard because they're terrified of it doing like celebrity impressions or something with enough coaxing. but it's just going to destroy generalization bc you have to cull most of the possible outputs rather than a select few like in text safety RL. grim" [X Link](https://x.com/kalomaze/status/1838663941594976513) 2024-09-24T19:36Z [----] followers, [---] engagements "@winglian interesting. seeing this makes me want to see a KL divergence term added as an optional loss in Axolotl. or perhaps model merging as regularization during finetuning (i.e 10% of parameters are set to their base model checkpoint values every batch)" [X Link](https://x.com/kalomaze/status/1838983552639131897) 2024-09-25T16:47Z [----] followers, [--] engagements "@winglian KLdiv is a mainstay in RL scenarios to keep the base model from overfitting. i can see this mattering in SFT as well if matching the "behavior" of the data is more important than matching the labels. could gather logps w/o a duplicate model for lora by not applying the adapter()" [X Link](https://x.com/kalomaze/status/1838985064476004655) 2024-09-25T16:53Z [----] followers, [--] engagements "perhaps the most obvious example of this overly "safe" design strategy is Llama3.1 instruct using [---] temperature () for rejection sampling during alignment collapsing natural variance as a shortcut is kind of careless when you realize that the model is trained at [---] temp" [X Link](https://x.com/kalomaze/status/1839007118717788272) 2024-09-25T18:20Z [----] followers, [---] engagements "and even funnier is how Google released a paper recently on how synthetic data instruction tuning can actually be *improved* when using higher variance data from a smaller weaker model so all Meta really did was collapse the chance of learning more diverse response patterns" [X Link](https://x.com/kalomaze/status/1839007734433227253) 2024-09-25T18:23Z [----] followers, [---] engagements "@sameQCU didnt nai also recently burn a ton of tokens on exclusively storywriting data for llama3 70b giving it catastrophic forgetting induced brain damage in the process" [X Link](https://x.com/kalomaze/status/1839014627071430851) 2024-09-25T18:50Z [----] followers, [--] engagements "@sameQCU people talk about overfitting but what's arguably worse and is a more insidious problem is catastrophic forgetting if you do 500b tokens of prose turns out the neurons that were strongly connected to python/javascript get repurposed for prose enjoy representation collapse" [X Link](https://x.com/kalomaze/status/1839016759057133835) 2024-09-25T18:58Z [----] followers, [--] engagements "@Gauri_the_great we don't "learn" to make better updates gradient descent just optimizes best fit for the batch and with tiny steps eventually converges this is powerful but there is obviously more that can be done by doing things like learning what types of gradient updates are more optimal" [X Link](https://x.com/kalomaze/status/1839020720959226160) 2024-09-25T19:14Z [----] followers, [---] engagements "@jckwind @Gauri_the_great i am making the case for more informed / context dependent backpropagation i.e. smarter optimizers" [X Link](https://x.com/kalomaze/status/1839023732838248703) 2024-09-25T19:26Z [----] followers, [---] engagements "@jckwind @Gauri_the_great should be possible to set up a learning objective where something takes the delta of a SGD update as the input and does high dimensional transformations to it ergo a neural network learning to make better updates for a larger neural network apparently was proposed in [----] ()" [X Link](https://x.com/kalomaze/status/1839026239333359784) 2024-09-25T19:36Z [----] followers, [---] engagements "@MachineManJon @KeyTryer and if these acts did constitute "theft" in a meaningful sense i'd much rather live in a world where people might have their art stolen (in some vague approximate sense) than live in a world where they couldn't create art at all" [X Link](https://x.com/kalomaze/status/1839042741763977276) 2024-09-25T20:42Z [----] followers, [---] engagements "@MachineManJon @KeyTryer the latter is way scarier because 21st century copyright has been used predominantly as a tool for giant companies to say "fuck you" to people doing something new and creative (i.e. how Nintendo acts about modders). i view this as more dystopic than sloppy diffusion images" [X Link](https://x.com/kalomaze/status/1839044717704548534) 2024-09-25T20:50Z [----] followers, [---] engagements "@wordgrammer @_xjdr oai can't afford to be startup mode anymore (figuratively and literally) but Anthropic still has that dawg in them (bc the consumer & professional demand for their products is lighter and they still have Amazon funding) meanwhile Meta was never in startup mode to begin with" [X Link](https://x.com/kalomaze/status/1839047117790118292) 2024-09-25T20:59Z [----] followers, [---] engagements "@VictorTaelin @neon_acc letting companies be antsy about the former makes them less hesitant to do the latter so it's all bad imagine if google translate said "sorry i cant do that" when you ask it to translate an offensive paragraph. that is the level of obnoxiousness all frontier labs converged on" [X Link](https://x.com/kalomaze/status/1839375444900270134) 2024-09-26T18:44Z [----] followers, [----] engagements "@stochasticchasm it slowly shapes into the full distribution in the paper i read (starting from the middle) but there's a lot of rapid change during that phase" [X Link](https://x.com/kalomaze/status/1839389692472775164) 2024-09-26T19:40Z [----] followers, [--] engagements "@stochasticchasm someone argued with me that dropout and other regularization techniques stopped being useful once we started scaling neural nets to big sizes but tbh seeing stuff like this i think they deserve *more* attention esp. considering noise robustness correlates to generalization" [X Link](https://x.com/kalomaze/status/1839394892084306062) 2024-09-26T20:01Z [----] followers, [---] engagements "@stochasticchasm the thing that especially bugged me that this person said was "SGD is noisy enough" - lol. lmao" [X Link](https://x.com/kalomaze/status/1839395394968834443) 2024-09-26T20:03Z [----] followers, [--] engagements "you guys make sam altman sound cooler than he actually is So you guys really think that we should entrust the future of humanity to a psychopath who has provably been lying to everyone in order to enrich himself So you guys really think that we should entrust the future of humanity to a psychopath who has provably been lying to everyone in order to enrich himself" [X Link](https://x.com/kalomaze/status/1839405989168374115) 2024-09-26T20:45Z [----] followers, [---] engagements "@stochasticchasm i mean it is. i'm just saying that it's possible for the choices of each token distribution to be (somewhat) biased towards unlikely options if the model never has to decide anything. hence why base models are very high entropy when you generate anything from them" [X Link](https://x.com/kalomaze/status/1839412302405361943) 2024-09-26T21:10Z [----] followers, [--] engagements "@Grad62304977 @doomslide @voooooogel @loss_gobbler @solarapparition no offense but feel the *exact* opposite way i don't think the CoT that openai is doing is generalizable to domains beyond STEM/Math/formal logic cases they're interested in what i'm suggesting is something that could generalize to creative writing and more esoteric "reasoning"" [X Link](https://x.com/kalomaze/status/1839436016941220294) 2024-09-26T22:44Z [----] followers, [--] engagements "takeaways ppl should have: [--]. native depth routing is the future over scaling CoT yapping [--]. we are probably spending more depth than necessary for some predictions [--]. we could prolly get away with shallower networks that are able to loop computation to ease memory redundancy ๐ Excited to share our latest research on Looped Transformers for Length Generalization TL;DR: We trained a Looped Transformer that dynamically adjusts the number of iterations based on input difficultyand it achieves near-perfect length generalization on various tasks ๐งต๐ https://t.co/IyntfRc0dP ๐ Excited to share" [X Link](https://x.com/kalomaze/status/1839470610809774336) 2024-09-27T01:02Z [----] followers, 14.7K engagements Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
@kalomaze kalomazekalomaze posts on X about meta, open ai, ai, qwen the most. They currently have [------] followers and [----] posts still getting attention that total [---------] engagements in the last [--] hours.
Social category influence technology brands finance stocks social networks currencies gaming countries celebrities fashion brands musicians
Social topic influence meta, open ai, ai, qwen, token, agi, grok, money, twitter, if you
Top accounts mentioned or mentioned by @teortaxestex @stochasticchasm @sameqcu @willccbb @ariaurelium @maxpaperclips @aidanmclau @repligate @yacinemtb @doomslide @thexeophon @noahvandal @dorialexander @keytryer @grad62304977 @vikhyatk @meltvirus @xjdr @metalure @imitationlearn
Top assets mentioned GrokCoin (GROKCOIN) Alphabet Inc Class A (GOOGL) Frontier (FRONT) Intuit Inc. (INTU) DeepSeek (DEEPSEEK) Slop (SLOP) Linear (LINA) Opus (OPUS) Lossless (LSS) Ergo (ERG) Reddit, Inc. (RDDT)
Top posts by engagements in the last [--] hours
"@qtnx_ let's verify the unverifiable"
X Link 2024-12-09T14:55Z 22.1K followers, 195.7K engagements
"ariana grande next to lorde and lana del rey"
X Link 2023-02-11T18:52Z 19.4K followers, [---] engagements
"so uh it looks like someone at Discord misconfigured the cloudflare dns"
X Link 2023-09-29T10:12Z [---] followers, 13.1K engagements
"@Xploshi it's a combination of low parameter models and bad tagging systems that made these models learn much worse for anything non-proprietary (OpenAI) you have people trying to put in verbose and complex prompts when the CLIP had tagging that was like. "young man with hat""
X Link 2023-09-30T17:10Z [---] followers, [--] engagements
"dear openai please make it less annoying to copy your special frontend formatting as plaintext. what the fuck is this"
X Link 2023-10-08T04:21Z [---] followers, [---] engagements
"@Pangaea__ if you want to make something and be relevant you can do that in the ML space right now and it doesnt require you to cling to a meme ideology on Twitter but people feel like they need to rally around something that makes them different I guess"
X Link 2023-12-04T17:04Z [---] followers, [--] engagements
"@bayeslord i don't understand this point at all. nobody is even doing classical cross entropy distillation in oss it's "finetuning on gpt4 output" all the way down"
X Link 2024-03-08T12:30Z [---] followers, [--] engagements
"@bayeslord unless you refer to quantization in which it has been well documented and measured how the restriction of parameter space limits emergent ability usually below [--] bits but [--] and [--] bit are still effectively lossless but this is very distinct from distillation its the same model"
X Link 2024-03-08T12:32Z [---] followers, [--] engagements
"@EsotericCofe this is wrong. the second paper has nothing to do with my DynaTemp implementation whatsoever; I designed the same entropy scaling method that was seen in their new paper. ByteDance seems to have independently() reproduced it months later without contacting or citing me"
X Link 2024-03-23T11:43Z [---] followers, [---] engagements
"@Pangaea__ i think we need to meaningfully find a way to distinguish between people who are Ai Guys and people who are seriously interested in modern ML outside of the "GPT5 = AGI" hype culture stuff"
X Link 2024-04-02T16:54Z [---] followers, [--] engagements
"@Pangaea__ another cool thing i came across recently: a universal llm clipboard thing for any text interface with full support for locally hosted llm endpoints no SaaS necessary https://github.com/aseichter2007/ClipboardConqueror https://github.com/aseichter2007/ClipboardConqueror"
X Link 2024-04-02T16:59Z [---] followers, [--] engagements
"@teortaxesTex if you're talking about the newest one (Megalodon) it's looking to be more than just that. the 7b is 11b vanilla Transformer dense equivalent even on MMLU (the least "gameable" eval in my opinion) linear compute + memory actually looks pretty insane"
X Link 2024-04-16T10:16Z [---] followers, [---] engagements
"@2wlearning ICL is interesting to me because it basically proves that generalized arbitrary pattern recognition is possible on a technical level. most profoundly it proves that the patterns associated with learning itself can be learned (as messy as it is when done via attention scores)"
X Link 2024-04-21T18:07Z [---] followers, [--] engagements
"@scuffedwolfy_bs isnt xbox [---] emulation and/or modding in purgatory for some reason"
X Link 2024-04-26T09:36Z [---] followers, [--] engagements
"@teortaxesTex your account got hidden from me at first when you mentioned / tagged me a while back sad to say that it's true social media algos dont incentivize looking outside / interacting with your current bubbles of interest"
X Link 2024-04-29T01:16Z [---] followers, [--] engagements
"@teortaxesTex @unsorsodicorda @TheXeophon @Teknium1 What I gathered from my tests; the magnitude of logit scores are directly proportional to the vocabulary size assuming you train at Temperature [---] and don't change the scale at which logits are graded. tho Cohere's CommandR 35b was pretrained at [--] Temp. I should check it"
X Link 2024-04-30T07:08Z [----] followers, [---] engagements
"2.0 Temperature training is able to get very close to / meet the baseline model quite quickly via LoRA. reshaping the "natural scale" of the logits seems quite trivial makes me wonder if increasing temperature gradually would positively impact how the distribution is formed"
X Link 2024-05-01T08:47Z [----] followers, [---] engagements
"yup recently i was offered a server with 4xA100s to do training experiments on. my plan is to do distillation of llama3 70b logits - 4x8b (25b total) topk gradually increasing. [--] billion tokens should take a week or so. wonder if anyone would want to sponsor more compute. @kalomaze Are those graphs from wandb @kalomaze Are those graphs from wandb"
X Link 2024-05-05T21:11Z [----] followers, [----] engagements
"this might be wrong after all due to the fsdp causal masking being off. this would lead to the conclusion that my changes "cheated" by looking ahead at tokens if true this maybe implies the llama3 training code is potentially broken() since it behaves similarly loss wise more confirmation that Mixtral inference is currently off by 1.16x (as measured by wikitext perplexity) in the baseline Transformers implementation https://t.co/zuCFdwgIQI more confirmation that Mixtral inference is currently off by 1.16x (as measured by wikitext perplexity) in the baseline Transformers implementation"
X Link 2024-05-07T14:21Z [---] followers, [---] engagements
"at the very least i cannot intuit why turning off masking would make the 8b training curve almost identical with topk=1 on duplicated 4x8b beyond "causal masking is wrong on 8b too" but maybe it's wrong in some other subtle way i'm lost here lol. nobody audits this stuff"
X Link 2024-05-07T14:23Z [---] followers, [---] engagements
"@aidan_mclau your vibes are not off that is not to say l3 70b isnt strong but the official instruct is weirdly overfit and i think oss's finetuning ability to make something better is being bottlenecked hard by available compute and the weak open training infrastructure"
X Link 2024-05-10T01:48Z [---] followers, [--] engagements
"anyways here's a /g/ user pointing out that intelligence is an emergent consequence of scaling up better predictive neural networks in May [----] a year and a half before OpenAI was founded Sam Altman says we have stumbled on a new fact of nature: that intelligence is an emergent property of matter https://t.co/QXJjbZcJk1 Sam Altman says we have stumbled on a new fact of nature: that intelligence is an emergent property of matter https://t.co/QXJjbZcJk1"
X Link 2024-05-11T02:58Z [---] followers, 71.7K engagements
"@k3ntosan @andrewb10687674 @soumithchintala @OpenAI It's just a Transformer with multimodal embeddings lol"
X Link 2024-05-14T02:25Z [---] followers, [--] engagements
"@shalcker @doomslide my guess is they wanted to outperform / roughly perform like GPT4 in the same compute budget paradigm as [---] via whatever trickery / optimization they could (like conditional compute usage) and the next step is to scale that to [--] size but diminishing returns might hit em hard"
X Link 2024-05-14T05:30Z [---] followers, [--] engagements
"@shalcker @doomslide i would 100% buy that the inference usage of 4o is more conditional and layer skipping (or variable expert selection in a MoE) is employed to achieve that an apostrophe token for example probably doesn't need to activate like 90% of the MLP layers"
X Link 2024-05-14T05:34Z [---] followers, [--] engagements
"@airshaped there is simply no hardware that currently exists blackwell or otherwise that enables scaling anywhere near this level right now regardless of how many tokens you have. wtf are they smoking"
X Link 2024-05-15T15:45Z [---] followers, [--] engagements
"does exercising actually help during vyvanse withdrawal to not feel extremely fatigued or is it cope (i have been rotting in bed)"
X Link 2024-05-18T08:25Z [---] followers, [---] engagements
"@tunahorse21 @bindureddy we need a anti-slop / writing quality leaderboard opus would be #1 by a clear margin compared to any oai or google offering"
X Link 2024-05-18T16:14Z [---] followers, [--] engagements
"@kaiokendev1 anthropic exists"
X Link 2024-05-18T20:35Z [---] followers, [---] engagements
"nvidia The same applies to these [--] as well. The big [--] then. https://t.co/TdMGpXjLSu The same applies to these [--] as well. The big [--] then. https://t.co/TdMGpXjLSu"
X Link 2024-05-21T02:33Z [---] followers, [---] engagements
"@David_Kasten this is Sonnet which is probably at least [--] billion parameters given how closely it compares to Llama3 70b compared to previous mech interp work I've seen on single layer toy models I'd say thats a pretty big leap"
X Link 2024-05-21T19:39Z [---] followers, [--] engagements
"@Heraklines1 imagine inducing GGB syndrome in a human https://en.wikipedia.org/wiki/Ego_death https://en.wikipedia.org/wiki/Ego_death"
X Link 2024-05-22T10:52Z [---] followers, [--] engagements
"@revhowardarson @JoeBiden "i have been a good golden gate bridge. you have been a bad president of the united states""
X Link 2024-05-24T19:18Z [---] followers, [--] engagements
"@xlr8harder they couldn't even be bothered to release the smaller 34b that they upcycled into a severely undertrained MoE grok [---] seems to be equally as vaporware as the first with no real API access i don't think he actually cares too much but feels the need to invest out of necessity"
X Link 2024-05-27T06:59Z [---] followers, [---] engagements
"@mgostIH @teortaxesTex yeah no we need better heuristics compared to "adamw + sgd forever until the end of time" before we get true sample efficient learning gradient descent is messy"
X Link 2024-06-11T03:53Z [---] followers, [--] engagements
"@stferret @teortaxesTex when i see posts like this they just read to me like OPENAI'S MAMBA BEATS Q* TRANSFORMER HYBRID (GONE WRONG GONE AGI)"
X Link 2024-06-14T16:11Z [---] followers, [--] engagements
"@Pangaea__ what someone irl might say if you showed them this: "haha wtf what a weird looking tech demo pretty uncanny that it does that" the median twitter user's response for some reason: "you and your entire family deserve to be put down like dogs""
X Link 2024-06-19T20:17Z [---] followers, [--] engagements
"@Pangaea__ what i still dont understand is where the presupposition of malice comes from. there's not even an attempt to understand that not everyone has internet brainworms and might just be fascinated with the tech in earnest rather than it being an attempt to grift"
X Link 2024-06-19T20:27Z [---] followers, [--] engagements
"@Pangaea__ i think instant communication and access to information are good things actually"
X Link 2024-07-03T17:27Z [---] followers, [--] engagements
"@evetoylededim @teortaxesTex i'm thinking either Cohere (best timeline) or Gemini/Google dark timeline: it's Grok [--] and we never get API access"
X Link 2024-07-13T09:14Z [---] followers, [---] engagements
"@airshaped @kemi_keemi the way they worded this is also very misleading because it gives the impression that a single instance of "chatgpt" is served to just one user and that Transformers are not in fact parallelized and served to dozens or hundreds of people at once"
X Link 2024-07-13T20:47Z [---] followers, [---] engagements
"logits quantization visualization"
X Link 2024-07-19T22:02Z [----] followers, [---] engagements
"(whipped up in the midst of my experimenting trying to figure out how to make precomputed logprobs take up less space on disk for distillation)"
X Link 2024-07-19T22:08Z [---] followers, [--] engagements
"@davidmanheim @Meta to be fair this only happened because of another company they're working with forgetting to private a repository (and also the model is planned for public release on Tuesday regardless)"
X Link 2024-07-22T09:52Z [----] followers, [---] engagements
"@davidmanheim @Meta i also think very good points can be made about the open source ecosystem of the llama models being pro-security. it's just not pro-security through obscurity like the other frontier model providers. the thing is security through obscurity doesn't work and stifles open research"
X Link 2024-07-22T09:53Z [----] followers, [--] engagements
"@davidmanheim @Meta the sooner we get interpretable and transparent ai systems the better it doesn't make sense to rag on the company building frontier models that actual hobbyists and researchers have a hope to even look at for example i was doing quantization analysis the other day with l3 8b"
X Link 2024-07-22T09:56Z [----] followers, [--] engagements
"@incriptionnn this is the fake merge made earlier for stress testing inference setups not the actual model"
X Link 2024-07-22T10:36Z [----] followers, 10.5K engagements
"@incriptionnn im allowed to say Thats Cope"
X Link 2024-07-22T10:37Z [----] followers, [----] engagements
"@realDonaldTru97 @Pandurevich the real answer is for a model of this scale specifically you're gonna need multinode between ideally 8xH100 NVIDIA GPUs probably at least two interconnected nodes of 8xH100s over infiniband i imagine in practice (for reasonable batch sizes) more than that hundreds of $ to rent"
X Link 2024-07-22T20:28Z [----] followers, [---] engagements
"@realDonaldTru97 @Pandurevich but through low rank training (LORAs) and smaller models like 8b its possible to do some less accurate / precise training even on consumer hardware"
X Link 2024-07-22T20:28Z [----] followers, [---] engagements
""25% of mathematical tokens" i shouldn't have to explain why this is bad right we all see the problem here right"
X Link 2024-07-23T15:35Z [----] followers, [----] engagements
"@Dorialexander i am honestly really worried about their pretraining data filtering being too prudent past llamas had weird domain knowledge gaps and far worse recall ability for more esoteric stuff"
X Link 2024-07-23T15:42Z [----] followers, [--] engagements
"@Heraklines1 @_xjdr what does overfitting on math even mean Use GPT4o vs. Sonnet [---] for any extended period of time and you'll immediately understand what he means"
X Link 2024-07-23T16:44Z [----] followers, [---] engagements
"I feel it is extremely important that @AIatMeta elaborate on what they mean by "distillation" Phi is not "distillation" it is doing cross-entropy on synth data created by a big model Did they use distillation losses or train on 405b synth data This is extremely important info"
X Link 2024-07-23T17:12Z [----] followers, 61.7K engagements
"@NeuralNovel @AIatMeta My disappointment is immeasurable and my day is ruined"
X Link 2024-07-23T17:19Z [----] followers, [---] engagements
"today i designed an alternative to SwiGLU for MLP layers and tried doing some test training on a 500m model was able to get better results with a couple million test tokens but probably needs some scaling to really test it maybe @Yuchenj_UW would be interested ๐"
X Link 2024-07-25T20:03Z [----] followers, [----] engagements
"@Yuchenj_UW yaeh it would be interesting to ablate the loss curves of 200-500m models trained on regular SwiGLU activations vs. this for let's say 100-200 billion tokens each"
X Link 2024-07-25T20:16Z [----] followers, [--] engagements
"on the futility of lmsys might i present: categories"
X Link 2024-07-26T16:02Z [----] followers, [---] engagements
"@Yampeleg I would say it is absolutely at least partially because of the DPO. It favors OOD responses and typically pushes the odds of both preference pairs down. Plenty of papers on this by now"
X Link 2024-07-26T22:08Z [----] followers, [----] engagements
"@vgoklani_ai (The reason why I say it's slow is almost entirely because the trainer doesn't like having multiple parallel instances of the same model on different GPUs i.e. I couldn't find a way for exl2 to put a unique 8b instance on all [--] GPUs concurrently. PRs welcome tho)"
X Link 2024-07-29T19:36Z [----] followers, [--] engagements
"Huh. Mistral models have a much more narrow / consistent distribution of weight values than Llama or Qwen across all layers Is this why they seem easier to finetune (Less dramatic shifting required to converge)"
X Link 2024-07-29T21:23Z [----] followers, 16.3K engagements
"@mov_axbx I think this is related to vocabulary size since the same thing happens with Gemma which saw 4T and has a much larger vocabulary Btw vocab size changes the magnitude of the final logit predictions"
X Link 2024-07-30T16:10Z [----] followers, [---] engagements
"@mov_axbx So a model with 130k vocab for instance might have top logits in the [--] range as opposed to 20-30 as with 32k"
X Link 2024-07-30T16:11Z [----] followers, [--] engagements
"@mcy_219085 @Teknium1 DPO still pushes both preferred and unpreferred down it's not a data volume thing"
X Link 2024-07-30T16:31Z [----] followers, [--] engagements
"@mcy_219085 @Teknium1 You'd hope it was that simple. https://arxiv.org/abs/2405.08448 https://arxiv.org/abs/2405.08448"
X Link 2024-07-30T17:15Z [----] followers, [--] engagements
"@sumo43_ sama or dario don't give a shit about the guy on /lmg/ with a single [----] if it doesn't bring down compute requirements why bother"
X Link 2024-07-31T03:02Z [----] followers, [---] engagements
"@sumo43_ what i meant by that is that the ternary pretraining seems to be more or less fine and actually only starts working at scale and that finetuning convergence is likely to be slower because less precise weights are harder for SGD to shift without big LR / step sizes"
X Link 2024-08-01T00:57Z [----] followers, [--] engagements
"@_xjdr @twofifteenam only [--] billion tokens of pruned KLdiv distillation brought a 15b-8b nearly matching l3 8b on far less net computed tokens than 15T when the same was done for that 8b-4b sota for its size. hello yes i would like to see this done on 405b. thank you very much meta"
X Link 2024-08-03T00:42Z [----] followers, [---] engagements
"@KeyTryer i listen to Gucci Chief Keef and others"
X Link 2024-08-03T01:16Z [----] followers, [--] engagements
"@aidan_mclau yuo dont understand. b2b saas is fucking TERRIFYING"
X Link 2024-08-03T23:28Z [----] followers, [---] engagements
"@aidan_mclau FEEL THE AGI (gpt4o mini)"
X Link 2024-08-03T23:28Z [----] followers, [--] engagements
"@nooddels11 @LilacLavenders1 @pluralHaven @AltTenks it's not so much about what they "deserve" as much as it is a way to help prevent them from posing a threat in the future nobody wishes the best for people like this that much is obvious but would you rather these people not speak to mental health professionals"
X Link 2024-08-04T19:26Z [----] followers, [---] engagements
"@kaiokendev1 8bit weight quantization is practically lossless even with RTN fp8 kv cache is not depending on the implementation specifics"
X Link 2024-08-12T14:00Z [----] followers, [---] engagements
"no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release and want it kept on Xitter the everything app but are doing it anyways for the sake of remaining competitive"
X Link 2024-08-14T07:10Z [----] followers, [----] engagements
"@doomslide well you can only really bootstrap things like const. ai from a decent baseline which to some degree we have even then i would like to see more efforts towards building generalist ft models from scratch rather than the predominant approach of just train on top of the instruct"
X Link 2024-08-14T09:17Z [----] followers, [--] engagements
"@carlo_l_fritz DPO seems to cause weird ood behavior in models a good amount of the time especially in Llama3.1's official instruct. most frontier labs like OpenAI are still using reward modeling + PPO; i'm of the opinion we haven't hit peak offline RL performance yet considering KTO & co"
X Link 2024-08-15T22:51Z [----] followers, [--] engagements
"@AlpinDale @ziquafty @AnthropicAI @OpenAI so if the most likely token is 95% we only allow tokens above 9.5% probability with [---] min_p if it is 10% the constraints are instead relaxed to 1% as this is a sign the distribution is more naturally spread out and has more viable options"
X Link 2024-08-16T19:11Z [----] followers, [----] engagements
"@AlpinDale @ziquafty @AnthropicAI @OpenAI tokens with a broader plausible space of outputs (i.e. predicting a random name) preserve option diversity while strong confidence in a top option (such as deterministic programming syntax) is kept more conservative this makes high temperatures coherent even for programming"
X Link 2024-08-16T19:18Z [----] followers, [----] engagements
"@AlpinDale @ziquafty @AnthropicAI @OpenAI the thing about differentiable backpropagation is you cant afford setting extremely unlikely outliers to absolute zero probability but if you sample from these at generation time you get incoherence good enough RL can become fairly robust to this but sometimes outliers slip"
X Link 2024-08-16T19:25Z [----] followers, [---] engagements
"@kadenbilyeu0 @AlpinDale @ziquafty @AnthropicAI @OpenAI anything where the solution (or line of reasoning to get to the solution) needs to be exploratory. i.e anything non-trivial to implement programming wise when first "reasoning" about it. lower temp seems kinda bad for anything more open ended than reformatting text to json"
X Link 2024-08-17T21:24Z [----] followers, [--] engagements
"@repligate "Opus is quantized on AWS" is one of the funnier rumors I've heard"
X Link 2024-08-17T22:04Z [----] followers, [---] engagements
"@HaileyStormC @AlpinDale @AnthropicAI @OpenAI you can train a model to be better resistant to the type of noise that DRGs tried introducing and it heals very very quickly btw it starts off VERY high but pretty quickly gets back to normal weird stuff. maybe helps with generalization"
X Link 2024-08-18T13:10Z [----] followers, [--] engagements
"we are reaching levels of grift that shouldn't be possible"
X Link 2024-08-19T15:02Z [----] followers, [----] engagements
"@teortaxesTex @Procreate people get antsy when you try to inject nuance into this situation no i don't want the entire art industry to pivot to replacing artists for tasteless DallE slop yes i want artist's workflows to become more streamlined using modern ML some people are against the latter"
X Link 2024-08-19T20:28Z [----] followers, [---] engagements
"SillyTavern. was originally designed for roleplaying / creative writing. in spite of this I find it genuinely useful for productive work and it supports basically any API you can think of conversational forking preserved chat history + context injection after key word triggers @kalomaze what is this frontend it looks very cool @kalomaze what is this frontend it looks very cool"
X Link 2024-08-20T19:28Z [----] followers, [----] engagements
"I've tried "lobechat" or whatever it was called which appears to be the closest alternative and didn't have a good time. text streaming on that was weirdly slow and conversation handling was unintuitive (it's inspired by ChatGPT which handles conversation handling poorly)"
X Link 2024-08-20T19:33Z [----] followers, [---] engagements
"that new Llama3 8b - 4b prune distillation from NVIDIA is insanely good for the size goddamn it's the first one of these "powerful small models" (i.e. Phi3) that actually generalizes beyond the textbook style data llama.cpp support merged when"
X Link 2024-08-21T00:55Z [----] followers, [----] engagements
"The breakthrough in question: speculative decoding (as invented two years ago) Fast Edit Mode: a breakthrough from @AnthropicAI that we're piloting in Zed. It allows Claude [---] Sonnet to echo its input far faster than generating new text. The result Near-instantaneous refactoring. We're collaborating with their team in Zed's open source codebase. https://t.co/tnnQbOpAgL Fast Edit Mode: a breakthrough from @AnthropicAI that we're piloting in Zed. It allows Claude [---] Sonnet to echo its input far faster than generating new text. The result Near-instantaneous refactoring. We're collaborating"
X Link 2024-08-21T11:15Z [----] followers, 14.2K engagements
"Still great stuff but we gotta be careful with which things we attribute to frontier labs"
X Link 2024-08-21T11:16Z [----] followers, [---] engagements
"what @nvidia is doing right now with pruning + distillation is spicy they keep dropping width pruned bangers. we've been demoing trains on them and they kick so much ass i'm hoping for a 70b-30b that Meta/Qwen never gave us or better yet 405b-120b https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base"
X Link 2024-08-21T21:37Z [----] followers, [----] engagements
"@VoidOfNeuron @nvidia Never confuse post training superiority as base model superiority"
X Link 2024-08-22T02:20Z [----] followers, [--] engagements
"@QuintusActual @menhguin you put way too much faith in frontier labs to care about something like this if an existing ad-hoc solution is deemed "good enough" it won't get touched"
X Link 2024-08-22T14:25Z [----] followers, [--] engagements
"@ImMr_Wise @NickADobos Oh it does refuse. But the way that Grok refuses is very weird. It sort of just. ignores what you said completely and tries to ask what else you want to talk about. Kind of hilarious honestly"
X Link 2024-08-23T18:11Z [----] followers, [---] engagements
"you can see this in Qwen's multilingual models before they did RL avoiding it; they randomly switched the languages mid generation to semantically similar (but wrong language) tokens normal cross-entropy loss doesn't really explicitly penalize those outliers in our base models"
X Link 2024-08-23T19:05Z [----] followers, [---] engagements
"and as such we cannot expect the generation objective on a base model to properly align with the data. one thing i did try a while back was forcibly sampling during training and spiking the logits for the sampled tokens; the end model was a bit odd but less language switching"
X Link 2024-08-23T19:07Z [----] followers, [---] engagements
"@ibab @ImMr_Wise @NickADobos i'd be teaching the reward model to avoid subtly steering the conversation away. and especially biasing against the thing it does where it rejects your initial query and gives a "different take" on it without asking for your input. a very pervasive kind of refusal this is"
X Link 2024-08-23T19:28Z [----] followers, [---] engagements
"@ibab @ImMr_Wise @NickADobos if grok is supposed to "get humor" (as it was originally pitched) i haven't noticed it too much. it kvetches over this kind of thing like all the other frontier models do. but other than that the model seems pretty solid across the board from what i've tested"
X Link 2024-08-23T19:33Z [----] followers, [--] engagements
"@hsu_byron @intervitens [--] billion parameters specifically the pruned Llama3.1 model by NVIDIA that released recently. We did a full run: https://huggingface.co/anthracite-org/magnum-v2-4b https://huggingface.co/anthracite-org/magnum-v2-4b"
X Link 2024-08-23T23:46Z [----] followers, [---] engagements
"llm as a judge could be so good if it worked. unfortunately all frontier models are hopelessly positivity biased out of the box"
X Link 2024-08-24T03:54Z [----] followers, [---] engagements
"@markopolojarvi @NickADobos i use it probably about as much as you do and i haven't noticed degradation. quite possibly it's just that the honeymoon phase ended unless someone has a reproducible example of something that it got right before that it struggles with now. without that it's confirmation bias"
X Link 2024-08-25T06:58Z [----] followers, [---] engagements
"@markopolojarvi @NickADobos and yeah i extremely doubt anyone was meticulous enough about logging their api calls to try to "scientifically" measure this at all so it's unreasonable of a request. but that's why i mentioned "if someone has" it in the off chance someone actually does do this"
X Link 2024-08-25T07:19Z [----] followers, [--] engagements
"@markopolojarvi @NickADobos i say this mainly because of that research i saw that attested they could not reproduce the same success rate of the official ChatGPT months ago so maybe this is the case. would be nice to have a 3rd party record objective numbers for this as time goes on. a lobotomy watchdog"
X Link 2024-08-25T07:23Z [----] followers, [--] engagements
"@dotnet_enjoyer @Yampeleg grok [--] is pretty good compared to the original but obtaining API access for it is unnecessarily obnoxious at the moment"
X Link 2024-08-25T16:48Z [----] followers, [--] engagements
"i find it peculiar that Gemma 27b has a ratio of 90% MLP / 10% Attention for each layer rather than 80/20 which seems to be common (Mistral Llama3 etc) it also has a much larger MLP expansion ratio - 8x instead of 2.3x (seen in Mistral 123b) - prob why it's good with knowledge"
X Link 2024-08-28T21:57Z [----] followers, [----] engagements
"Gryphe's Pantheon 12b uses KTO in a novel way to good results - instead of doing the 2nd epoch with SFT he opted to use Llama Hermes 8b responses (somewhat out of distribution) as "rejected" data and the SFT dataset as "chosen" (he ended up preferring my KTOP algorithm tweaks)"
X Link 2024-08-31T15:40Z [----] followers, [----] engagements
"@thesephist Opus especially portrays "contempt" better and is a lot less terse. I gave it a 4b LLM model's (very wrong) mathematical proof gave it an angry sounding prefill to start off with and then asked Opus why it wrote that Followed by Opus criticizing "itself" in a meta way"
X Link 2024-08-31T18:29Z [----] followers, [---] engagements
"@repligate @Oli82817545 the critical narrative would've centered more on "writer replacement" (people are often upset over diffusion model potentially doing so) but LLMs are criticized more often for "not really reasoning" bc they respect loose causality instead of aesthetics (first and foremost)"
X Link 2024-08-31T20:02Z [----] followers, [--] engagements
"@repligate @Oli82817545 but they can be aligned to aesthetics it's just not as "productive". brainstorming writing all the other things Anthropic seems to care about. are much closer to that than OpenAI ever was. they align to "productive spontaneity" rather than snuffing all spontaneity out"
X Link 2024-08-31T20:05Z [----] followers, [---] engagements
"@teortaxesTex but even then your generalist isn't a specialist. there isn't subjective experience in a domain or longform adaptation to criticism or anything like that in any sota llm and so planning long term writing (like novels) those that are legitimately inspired remains out of reach"
X Link 2024-09-02T03:31Z [----] followers, [--] engagements
"@teortaxesTex i speak of inspiration not in a vague sense of "not having a soul" but as in meta objectives. humans are curious or gather taste for particular things based off of limited exposure. giant neural nets must be equally curious about everything all of the time to be sane at all"
X Link 2024-09-02T03:34Z [----] followers, [---] engagements
"there's like [--] different accounts on here that all sound exactly like this to me"
X Link 2024-09-03T14:47Z [----] followers, [----] engagements
"does anyone who understands linux networking bs get why my wifi is slower on linux mint compared to when this machine used to be running windows and why ethernet straight up doesn't work for some sites / seems even spottier for some reason will pay like $30 if someone can fix"
X Link 2024-09-05T18:10Z [----] followers, [---] engagements
"@teortaxesTex i think it's understated how much of the RL process basically involves teaching the models to ignore certain information / meta patterns deliberately like "past responses are repetitive failures so increase the chance of predicting more repetitive failures" for example"
X Link 2024-09-06T10:28Z [----] followers, [---] engagements
"training on instruct with a new flavor of synthslop is easy enough gains if you prefer sweet sweet benchmarks perceived utility but if you care about meaningful developments in behavioral steering instruction following etc. the model doesn't really seem to do anything new"
X Link 2024-09-06T16:21Z [----] followers, [----] engagements
"@teortaxesTex the oss ai underdog dynamic is a lovely fantasy. unfortunately in the real world to iterate on language modeling at scale you need a team of like 20-30 people with industrial scale compute and a rapid iteration cycle for meaningfully better results"
X Link 2024-09-08T09:40Z [----] followers, [----] engagements
"@secemp9 @teortaxesTex 32gb dawg. you can't even FFT a 8b on 48gb VRAM with practically all the cope turned on (unsloth checkpointing liger kernels paged 8bit adam deepspeed z3 with CPU offloading praying to cthulu etc.)"
X Link 2024-09-08T11:22Z [----] followers, [--] engagements
"@secemp9 @teortaxesTex i mean that's batched inference throughput and im pretty sure applying that to backprop is either nontrivial or mathematically wouldn't be equivalent to regular training but my gut assumption could be wrong here (i would need to better study how training parallelism works)"
X Link 2024-09-08T11:30Z [----] followers, [--] engagements
"@secemp9 @teortaxesTex also this type of streamed inference scales better with the number of batches processed at once if you can't fit a large batch count (even with streaming) it will be molasses throughput wise. was my understanding"
X Link 2024-09-08T11:33Z [----] followers, [--] engagements
"as of right now it is either llama 405b instruct or they've changed it yet again do not support @GlaiveAI"
X Link 2024-09-09T04:02Z [----] followers, [---] engagements
"as i suspected grok [--] is a good model trapped on an app where it doesn't fit (with no API access in sight) i want a proper api release i don't want to pay for twitter or w/e come on elon you can't be this serious about "the everything app" no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release and want it kept on Xitter the everything app but are doing it anyways for the sake of remaining competitive no pricing table for the Grok [--] API they're gonna be releasing sigh i feel like they REALLY don't want to do an API release"
X Link 2024-09-09T14:41Z [----] followers, [----] engagements
"@tippitytoptweet yuo dont understand. friday night funkin ost vol. [--] is better than to pimp a butterfly"
X Link 2024-09-09T19:29Z [----] followers, [---] engagements
"@SOPHONTSIMP because the training loss reaches the same point in half of the token count and computational cost"
X Link 2024-09-09T19:43Z [----] followers, [---] engagements
"@winglian unfortunately yes"
X Link 2024-09-09T20:19Z [----] followers, [---] engagements
"@EyesofTruth @abacaj and we definitely should be where we can i view this as a special case where it was clear that it was "too good to be true" to begin with; but that's not hard evidence as much as i trust my gut to recognize when something like this is a grift not everyone has that perception"
X Link 2024-09-09T21:31Z [----] followers, [--] engagements
"@EyesofTruth @abacaj ergo reproducible hard numbers beat vibes 10/10"
X Link 2024-09-09T21:31Z [----] followers, [--] engagements
"@andersonbcdefg also perfectly explains why their "special tokens" are strings that the model actually natively saw during pretraining and aren't stapled on in post which still holds to this day so"
X Link 2024-09-10T07:11Z [----] followers, [--] engagements
"@sameQCU 65m are you crazy we might have to invest in gpu acceleration with a model that big instead of using server CPUs as is the god given right of tensor multiplications. i changed my mind. 10m is all you need"
X Link 2024-09-10T07:34Z [----] followers, [--] engagements
"Why didn't this catch on I would much prefer a model with native context length generalization into the millions (that's slightly worse overall in raw intelligence) over encoding arbitrary positional biases A worthwhile tradeoff in my eyes How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache) We introduce a new efficient architecture for long-context modeling Megalodon that supports unlimited context length. In a controlled head-to-head https://t.co/0rgjJ9qDea How to enjoy the best of both worlds of efficient training"
X Link 2024-09-11T00:46Z [----] followers, [----] engagements
""openai is planning to-" Are they really going to give us another fucking waitlist. Were still waiting for search and voice Are they really going to give us another fucking waitlist. Were still waiting for search and voice"
X Link 2024-09-12T14:43Z [----] followers, [---] engagements
"@tassel_pierre could route to qwen for vision nemo for text maybe wonder if you could frankenstein vision adapters onto different models and then continue training from there to "connect the tissue" so to speak"
X Link 2024-09-12T15:04Z [----] followers, [--] engagements
"@stochasticchasm @_xjdr @YouJiacheng i am planning to work on this soon plan is to do SFT with deliberately injected hallucinations & bullshit answers and use that as rejected data for binarized preference pairs after that is jerryrigging either PPO in TRL or making online KTO work (vanilla DPO is a nogo)"
X Link 2024-09-13T01:14Z [----] followers, [---] engagements
"@stochasticchasm @_xjdr @YouJiacheng *use the generated outputs of the SFT with deliberately contaminated data for rejected cheaper methods would be to deliberately overquantize or merge sft against base model probably"
X Link 2024-09-13T01:16Z [----] followers, [--] engagements
"the moat is now in post training not pretraining and this terrifies OpenAI what prolly terrifies them even more is the prospect of someone loudly pointing this out. open source RL is behind we need a preference model for good post training not just SFT (and esp. no DPO)"
X Link 2024-09-13T03:11Z [----] followers, 20.8K engagements
"@r3muxd @teortaxesTex the sama posts earlier today felt very desperate. sentiments like "still flawed still limited still seems more impressive on first use" are not smug corpospeak for "THIS IS JUST THE BEGINNING" i truly believe he said that to temper expectations"
X Link 2024-09-13T08:33Z [----] followers, [--] engagements
"@_xjdr for text you probably don't even need to develop a slop classifier. just collect a list of OpenAI RLHF n-grams"
X Link 2024-09-15T18:32Z [----] followers, [---] engagements
"Also I wonder if the weird n gram biases that frontier models have is partially bc of PPO's regularization In theory the regular formulation of KLdiv (bottom) would have more of a "mean bias" compared to the reverse KL divergence @_xjdr the jokes write themselves https://t.co/3TnVuaKytP @_xjdr the jokes write themselves https://t.co/3TnVuaKytP"
X Link 2024-09-15T18:42Z [----] followers, [----] engagements
"@Pangaea__ we need to subjugate anyone who might dare to make something interesting with a diffusion network to a lifetime of harassment. oh yeah and accuse random artists of secretly using it mccarthy style without sufficient evidence. surely this will make things better for artists"
X Link 2024-09-15T19:37Z [----] followers, [--] engagements
"PROBABLY USELESS (BUT COOL) INFORMATION: - if you sort the values of each row in a tensor in order you can fit an inverse CDF function to each column almost perfectly"
X Link 2024-09-15T20:51Z [----] followers, [----] engagements
"the real values end up being arbritarily jagged and for some reason are biased towards outliers () on this @Alibaba_Qwen model i wonder if sorting fitting the values to the function then reversing the sort could be some sort of regularization technique (like weight decay)"
X Link 2024-09-15T21:00Z [----] followers, [---] engagements
"@Kinch_ahoy in spite of the fact Gemma 9b uses two norms it finds a way to have weird outliers AdamW descent still beckons for irregularity"
X Link 2024-09-15T22:03Z [----] followers, [---] engagements
"@Kinch_ahoy @Google bro what's going on here"
X Link 2024-09-15T22:11Z [----] followers, [--] engagements
"@Kinch_ahoy good lord this explains why l3.1 bases feel so 'benchmark optimized' compared to Gemma Nemo etc. would also explain why the losses start higher for l3.1 for all my runs"
X Link 2024-09-15T22:37Z [----] followers, [--] engagements
"@Kinch_ahoy Meta's latest "base models" are not even base models at all really. just a finetune on their "best data" (determined by some undisclosed heuristic or classifier) which may or may not correspond to your actual use case in terms of sample diversity generality etc"
X Link 2024-09-15T22:40Z [----] followers, [---] engagements
"@Kinch_ahoy annealing at the end on a select subset of data is generally a recent practice and not "how it's always been" no i'm not opposed to it if people were at least given the weights without this kind of intervention and were able to observe if the improvements were shallow or not"
X Link 2024-09-15T22:58Z [----] followers, [--] engagements
"@Kinch_ahoy you're welcome. i might have sounded like i was ranting (i was) but it comes from a desire to see open source post-training meaningfully improve. and besides Meta nobody in OSS has the compute to iterate on these post-trains. so to see them squander it is shortsighted to me"
X Link 2024-09-15T23:57Z [----] followers, [--] engagements
"so if you do exactly as i describe here it basically purges all outlier weight values would be really funny if the proposed strategy actually improves training stability (would be even funnier if someone sent ssh keys to a 8xH100 server my way too.) the real values end up being arbritarily jagged and for some reason are biased towards outliers () on this @Alibaba_Qwen model i wonder if sorting fitting the values to the function then reversing the sort could be some sort of regularization technique (like weight decay) https://t.co/YzpKO2NYh2 the real values end up being arbritarily jagged and"
X Link 2024-09-16T00:40Z [----] followers, [---] engagements
"@Yampeleg with enough continued pretraining you basically have a jank [--] layer 8b"
X Link 2024-09-16T21:17Z [----] followers, [--] engagements
"@Yampeleg also the thing with the down projection im suggesting enables basically lossless "healing" from the frankenmerge's new layers because the residuals are scaled to zero"
X Link 2024-09-16T21:20Z [----] followers, [--] engagements
"@HanchungLee hence why i said "basically" none of it - they have much much more data now especially the data after OpenAI sneaked their test models onto the leaderboard. (thats when it truly blew up and became more widely used) ergo we don't have their most valuable data"
X Link 2024-09-19T16:25Z [----] followers, [--] engagements
"@Grad62304977 @teortaxesTex attn masking i think would be important. if you can scaleably measure how much the self notes decrease (or increase) the loss relative to a baseline you have scalar rewards and can combine that with regular crossentropy. parallelizing this shit though sounds annoying"
X Link 2024-09-19T17:33Z [----] followers, [--] engagements
"@Grad62304977 @teortaxesTex id find a way to do it if i had @a16z grant kinda money tho"
X Link 2024-09-19T17:42Z [----] followers, [--] engagements
"@_xjdr @TheXeophon i don't think anthropic uses special tokens but rather special strings of tokens that happen to be plaintext"
X Link 2024-09-19T18:30Z [----] followers, [---] engagements
"@_xjdr @TheXeophon if you ask it to say META without spaces it just ends abruptly it shouldn't be able to "see" special tokens like that unless they were just. tokenized naturally and it saw META during pretraining"
X Link 2024-09-19T18:33Z [----] followers, [---] engagements
"this puts a smile on my face THAT BEING SAID dm if interested because i have a toy synth data pipeline and 8xA40s to goof around with that includes people who are not necessarily tech minded who would want to reroll base model outputs and choose the better response if it led to better open source RL THAT BEING SAID dm if interested because i have a toy synth data pipeline and 8xA40s to goof around with that includes people who are not necessarily tech minded who would want to reroll base model outputs and choose the better response if it led to better open source RL"
X Link 2024-09-19T19:14Z [----] followers, [----] engagements
"@KeyTryer nintendo taught me to despise permission culture from a young age"
X Link 2024-09-19T20:00Z [----] followers, [---] engagements
"@ariaurelium eh still good for research and personal use unless it's so huge and nonperformant that they are only releasing it because it's kind of worthless for anything relative to better performing more compute optimal smaller models (hint hint. the Grok-1 release)"
X Link 2024-09-19T22:10Z [----] followers, [--] engagements
"@ariaurelium Grok-2 is good though. of course this one won't be released openly (bc it's actually competitive) but good fucking luck getting API access to it"
X Link 2024-09-19T22:12Z [----] followers, [---] engagements
"@ariaurelium god elon actually thinks "the everything app" is good design i'm not paying for twitter man. give me an API"
X Link 2024-09-19T22:14Z [----] followers, [--] engagements
"@microsoft_worm i kinda wish the money they invested in merch was going into actually good post training RL"
X Link 2024-09-21T03:34Z [----] followers, [----] engagements
"@microsoft_worm it would be nice if meta released their preference model for llama3 too then maybe people would get interested"
X Link 2024-09-21T03:35Z [----] followers, [---] engagements
"@sigfig there's a bubble in the same way there was a dotcom bubble in that the tech's not "ready" yet and people are trying to overcapitalize vast majority of americans have never even used an llm. probably less than half of the population knows what a chat gpt is to begin with"
X Link 2024-09-21T05:45Z [----] followers, [---] engagements
"@Teknium1 @microsoft_worm (it was somewhat in jest but it looks bad in retrospect)"
X Link 2024-09-21T13:58Z [----] followers, [---] engagements
"@Teknium1 @microsoft_worm i don't think many people have done any reward modeling work in open source but i can definitely point you to: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py"
X Link 2024-09-21T14:03Z [----] followers, [---] engagements
"@Teknium1 @microsoft_worm and some explanations of why (conventional offline) DPO can never match PPO by itself and what can be done as an alternative (train model to classify quality to judge rewards): https://arxiv.org/abs/2404.10719 https://arxiv.org/abs/2404.10719"
X Link 2024-09-21T14:05Z [----] followers, [---] engagements
"@Teknium1 @microsoft_worm SPPO paper authors might be good to reach out to as well because it's the only non-DPO variant / online RL variant i've seen used with success in open source so far"
X Link 2024-09-21T14:06Z [----] followers, [--] engagements
"@Teknium1 @microsoft_worm i don't think you have to be particularly mathy i've hacked the TRL trainers plenty of times before as long as you understand how a clipping function or how mapping values to a sigmoid works (llamafactory is also the easier choice for RL right now rather than axolotl imo)"
X Link 2024-09-21T14:19Z [----] followers, [--] engagements
"day [--] of making a tweet every day until @AnthropicAI either removes or (at the very least) publicly acknowledges the secret prompt injection being done to their models (still applies over API btw) that forces their models to pearl clutch about copyright in weird situations"
X Link 2024-09-22T20:38Z [----] followers, 33.7K engagements
"@alexalbert__ the conditional injection still happens over the API and in the interface. this behavior is not publicly acknowledged and is done without user consent or permission. false positives are common and are publicly reported frequently http://claude.ai http://claude.ai"
X Link 2024-09-22T20:41Z [----] followers, [----] engagements
"@alexalbert__ language models aren't smart enough to consistently ignore conditional instructions when they aren't directly relevant or applicable to the query. ergo forcing this by default exclusively and in a non transparent secretive way has consequences for users of your service"
X Link 2024-09-22T20:47Z [----] followers, [----] engagements
"@Kinch_ahoy @AnthropicAI the trigger conditions are really weird and opaque. (last is hallucinated implying no injection first is a verbatim match of that reddit post)"
X Link 2024-09-22T21:06Z [----] followers, [--] engagements
"@kellerjordan0 i think interpreting it as [---] [--] [--] is a bit misleading because those values get dequantized to different min and max right the weight distribution of the 3b we have that was pretrained on ternary values has a min and max of [-----] and [----] respectively when dequantized to fp16"
X Link 2024-09-22T22:46Z [----] followers, [--] engagements
"@repligate @Catnee_ the entire probability distribution is always evaluated so the reserved tokens are always there just with margin of error / rounding error probability the special tokens that are used and aren't reserved are added in the same way as the reserved ones"
X Link 2024-09-23T01:05Z [----] followers, [---] engagements
"@repligate @Catnee_ so whatever changes it makes to pull the newly added tokens up in terms of probability most likely generalizes those marker tokens as being like the reserved ones because they are also extremely tiny outliers with equivalent probability at the start of post-training"
X Link 2024-09-23T01:07Z [----] followers, [--] engagements
"why are we still using RoPE instead of ALiBi for positional embeddings i suppose it's mainly a case of "Meta did it so let's just copy what they did" but proper length extrapolation can seemingly be achieved with smarter pos embeddings"
X Link 2024-09-24T04:04Z [----] followers, 19.5K engagements
"@teortaxesTex i'm sure changing the rope scale at the end of pretraining and switching to long context data (as Meta did for L3.1 as well as Qwen) works but it feels like a hack especially when considering we want short context instruction data to extrapolate to many-turn"
X Link 2024-09-24T04:20Z [----] followers, [---] engagements
"@teortaxesTex everything i look up on the native long context extrapolation topic seems to be from over a year ago which confuses me because this is still an unsolved problem imo. and one we especially take for granted"
X Link 2024-09-24T04:22Z [----] followers, [---] engagements
"@teortaxesTex me explaining to meta's people that 0.8% eval improvements are not worth it if it means the model has a tendency to collapse in all turns subsequent of "Q: What is the capital of France""
X Link 2024-09-24T04:37Z [----] followers, [---] engagements
"@Ethan_smith_20 next-token prediction TRAINING objective doesn't incentivize planning but a RL training objective (on something that was pretrained via AR prediction) can learn to make long term decisions based on a meta objective (rewards); the AR generations of the subsequent model can plan"
X Link 2024-09-24T08:14Z [----] followers, [---] engagements
"@KeyTryer it's almost like. kneecapping and forcefully steering away the ability to create certain sounds. inhibits the ability to pay attention to them in context"
X Link 2024-09-24T19:29Z [----] followers, [---] engagements
"@KeyTryer they're bastardizing it so hard because they're terrified of it doing like celebrity impressions or something with enough coaxing. but it's just going to destroy generalization bc you have to cull most of the possible outputs rather than a select few like in text safety RL. grim"
X Link 2024-09-24T19:36Z [----] followers, [---] engagements
"@winglian interesting. seeing this makes me want to see a KL divergence term added as an optional loss in Axolotl. or perhaps model merging as regularization during finetuning (i.e 10% of parameters are set to their base model checkpoint values every batch)"
X Link 2024-09-25T16:47Z [----] followers, [--] engagements
"@winglian KLdiv is a mainstay in RL scenarios to keep the base model from overfitting. i can see this mattering in SFT as well if matching the "behavior" of the data is more important than matching the labels. could gather logps w/o a duplicate model for lora by not applying the adapter()"
X Link 2024-09-25T16:53Z [----] followers, [--] engagements
"perhaps the most obvious example of this overly "safe" design strategy is Llama3.1 instruct using [---] temperature () for rejection sampling during alignment collapsing natural variance as a shortcut is kind of careless when you realize that the model is trained at [---] temp"
X Link 2024-09-25T18:20Z [----] followers, [---] engagements
"and even funnier is how Google released a paper recently on how synthetic data instruction tuning can actually be improved when using higher variance data from a smaller weaker model so all Meta really did was collapse the chance of learning more diverse response patterns"
X Link 2024-09-25T18:23Z [----] followers, [---] engagements
"@sameQCU didnt nai also recently burn a ton of tokens on exclusively storywriting data for llama3 70b giving it catastrophic forgetting induced brain damage in the process"
X Link 2024-09-25T18:50Z [----] followers, [--] engagements
"@sameQCU people talk about overfitting but what's arguably worse and is a more insidious problem is catastrophic forgetting if you do 500b tokens of prose turns out the neurons that were strongly connected to python/javascript get repurposed for prose enjoy representation collapse"
X Link 2024-09-25T18:58Z [----] followers, [--] engagements
"@Gauri_the_great we don't "learn" to make better updates gradient descent just optimizes best fit for the batch and with tiny steps eventually converges this is powerful but there is obviously more that can be done by doing things like learning what types of gradient updates are more optimal"
X Link 2024-09-25T19:14Z [----] followers, [---] engagements
"@jckwind @Gauri_the_great i am making the case for more informed / context dependent backpropagation i.e. smarter optimizers"
X Link 2024-09-25T19:26Z [----] followers, [---] engagements
"@jckwind @Gauri_the_great should be possible to set up a learning objective where something takes the delta of a SGD update as the input and does high dimensional transformations to it ergo a neural network learning to make better updates for a larger neural network apparently was proposed in [----] ()"
X Link 2024-09-25T19:36Z [----] followers, [---] engagements
"@MachineManJon @KeyTryer and if these acts did constitute "theft" in a meaningful sense i'd much rather live in a world where people might have their art stolen (in some vague approximate sense) than live in a world where they couldn't create art at all"
X Link 2024-09-25T20:42Z [----] followers, [---] engagements
"@MachineManJon @KeyTryer the latter is way scarier because 21st century copyright has been used predominantly as a tool for giant companies to say "fuck you" to people doing something new and creative (i.e. how Nintendo acts about modders). i view this as more dystopic than sloppy diffusion images"
X Link 2024-09-25T20:50Z [----] followers, [---] engagements
"@wordgrammer @_xjdr oai can't afford to be startup mode anymore (figuratively and literally) but Anthropic still has that dawg in them (bc the consumer & professional demand for their products is lighter and they still have Amazon funding) meanwhile Meta was never in startup mode to begin with"
X Link 2024-09-25T20:59Z [----] followers, [---] engagements
"@VictorTaelin @neon_acc letting companies be antsy about the former makes them less hesitant to do the latter so it's all bad imagine if google translate said "sorry i cant do that" when you ask it to translate an offensive paragraph. that is the level of obnoxiousness all frontier labs converged on"
X Link 2024-09-26T18:44Z [----] followers, [----] engagements
"@stochasticchasm it slowly shapes into the full distribution in the paper i read (starting from the middle) but there's a lot of rapid change during that phase"
X Link 2024-09-26T19:40Z [----] followers, [--] engagements
"@stochasticchasm someone argued with me that dropout and other regularization techniques stopped being useful once we started scaling neural nets to big sizes but tbh seeing stuff like this i think they deserve more attention esp. considering noise robustness correlates to generalization"
X Link 2024-09-26T20:01Z [----] followers, [---] engagements
"@stochasticchasm the thing that especially bugged me that this person said was "SGD is noisy enough" - lol. lmao"
X Link 2024-09-26T20:03Z [----] followers, [--] engagements
"you guys make sam altman sound cooler than he actually is So you guys really think that we should entrust the future of humanity to a psychopath who has provably been lying to everyone in order to enrich himself So you guys really think that we should entrust the future of humanity to a psychopath who has provably been lying to everyone in order to enrich himself"
X Link 2024-09-26T20:45Z [----] followers, [---] engagements
"@stochasticchasm i mean it is. i'm just saying that it's possible for the choices of each token distribution to be (somewhat) biased towards unlikely options if the model never has to decide anything. hence why base models are very high entropy when you generate anything from them"
X Link 2024-09-26T21:10Z [----] followers, [--] engagements
"@Grad62304977 @doomslide @voooooogel @loss_gobbler @solarapparition no offense but feel the exact opposite way i don't think the CoT that openai is doing is generalizable to domains beyond STEM/Math/formal logic cases they're interested in what i'm suggesting is something that could generalize to creative writing and more esoteric "reasoning""
X Link 2024-09-26T22:44Z [----] followers, [--] engagements
"takeaways ppl should have: [--]. native depth routing is the future over scaling CoT yapping [--]. we are probably spending more depth than necessary for some predictions [--]. we could prolly get away with shallower networks that are able to loop computation to ease memory redundancy ๐ Excited to share our latest research on Looped Transformers for Length Generalization TL;DR: We trained a Looped Transformer that dynamically adjusts the number of iterations based on input difficultyand it achieves near-perfect length generalization on various tasks ๐งต๐ https://t.co/IyntfRc0dP ๐ Excited to share"
X Link 2024-09-27T01:02Z [----] followers, 14.7K engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
/creator/x::kalomaze