#  @bshlgrs Buck Shlegeris Buck Shlegeris posts on X about ai, if you, buck, this is the most. They currently have [-----] followers and [---] posts still getting attention that total [---] engagements in the last [--] hours. ### Engagements: [---] [#](/creator/twitter::2993757996/interactions)  - [--] Week [------] +377% - [--] Month [------] +194% - [--] Months [-------] -93% - [--] Year [---------] -2.70% ### Mentions: [--] [#](/creator/twitter::2993757996/posts_active)  ### Followers: [-----] [#](/creator/twitter::2993757996/followers)  - [--] Week [-----] +0.28% - [--] Month [-----] +1% - [--] Months [-----] +6.20% - [--] Year [-----] +23% ### CreatorRank: [---------] [#](/creator/twitter::2993757996/influencer_rank)  ### Social Influence **Social category influence** [technology brands](/list/technology-brands) 10.38% [finance](/list/finance) 4.72% [travel destinations](/list/travel-destinations) 3.77% [social networks](/list/social-networks) 2.83% [countries](/list/countries) 1.89% **Social topic influence** [ai](/topic/ai) 37.74%, [if you](/topic/if-you) 9.43%, [buck](/topic/buck) #474, [this is](/topic/this-is) 6.6%, [anthropic](/topic/anthropic) #3281, [in the](/topic/in-the) 3.77%, [llm](/topic/llm) 3.77%, [llms](/topic/llms) 2.83%, [more than](/topic/more-than) 2.83%, [performance](/topic/performance) 2.83% **Top accounts mentioned or mentioned by** [@julianboolean](/creator/undefined) [@ryanpgreenblatt](/creator/undefined) [@ciphergoth](/creator/undefined) [@simeoncps](/creator/undefined) [@trammel530765](/creator/undefined) [@thomaswoodside](/creator/undefined) [@shakeelhashim](/creator/undefined) [@anthropicai](/creator/undefined) [@slatestarcodex](/creator/undefined) [@elilifland](/creator/undefined) [@neelnanda5](/creator/undefined) [@ryanpgreenblatts](/creator/undefined) [@redwoodai](/creator/undefined) [@ohabryka](/creator/undefined) [@ajeyacotra](/creator/undefined) [@geoffreyhinton](/creator/undefined) [@lessig](/creator/undefined) [@melindabchu1](/creator/undefined) [@gavinnewsom](/creator/undefined) [@apolloaisafety](/creator/undefined) ### Top Social Posts Top posts by engagements in the last [--] hours "I think I did actually forget to tweet about this. I did a podcast with @RyanPGreenblatt (recorded six months ago released [---] months ago) that I'm pretty happy with. unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I definitely laughed a few times. https://t.co/aTWpMlcc9A unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I" [X Link](https://x.com/bshlgrs/status/2020276590891135049) 2026-02-07T23:20Z [----] followers, [----] engagements "Important new post: The arguments that AIs will pursue reward also suggest that AIs might pursue some other similar goals like "getting deployed". These similar goals have very different implications for risk so it's valuable to consider how their risk profile differs. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should also think "fitness-seekers" are plausible. But their risks aren't the same. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should" [X Link](https://x.com/bshlgrs/status/2016998602858811692) 2026-01-29T22:15Z [----] followers, [----] engagements "@ohabryka And it's of course a huge driver for polarizing AI safety and adjacent topics. I can't think of an interpretation of this sentence that I agree with. You're saying that this report contributes to polarization of AIS by only citing Arxiv rather than blog posts" [X Link](https://x.com/bshlgrs/status/2018759782505807918) 2026-02-03T18:53Z [----] followers, [----] engagements "@ajeya_cotra FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it's most likely you can get insane speed ups from LLMs while writing huge codebases" [X Link](https://x.com/bshlgrs/status/2019859245123121199) 2026-02-06T19:42Z [----] followers, [----] engagements "@Thomas_Woodside Why not talk to local open-source LLMs" [X Link](https://x.com/bshlgrs/status/2022098471051047085) 2026-02-13T00:00Z [----] followers, [----] engagements "@Thomas_Woodside Yeah I was imagining talking to the AI in a way that has no retention by default" [X Link](https://x.com/bshlgrs/status/2022119694124359953) 2026-02-13T01:24Z [----] followers, [---] engagements "People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022784270977425652 https://twitter.com/i/web/status/2022784270977425652" [X Link](https://x.com/bshlgrs/status/2022784270977425652) 2026-02-14T21:25Z [----] followers, [---] engagements "People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022785775411368438 https://twitter.com/i/web/status/2022785775411368438" [X Link](https://x.com/bshlgrs/status/2022785775411368438) 2026-02-14T21:31Z [----] followers, [---] engagements "@Simeon_Cps Here's one related quote https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/ https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/" [X Link](https://x.com/bshlgrs/status/1764713562315067490) 2024-03-04T18:04Z [----] followers, [----] engagements "@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasnt socially connected to them (eg see rat coverage of prospera prediction markets soylent etc)" [X Link](https://x.com/bshlgrs/status/1778853373137367169) 2024-04-12T18:31Z [----] followers, [----] engagements "@julianboolean_ Ok actually upon more thinking I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer" [X Link](https://x.com/bshlgrs/status/1786545330995311007) 2024-05-03T23:56Z [----] followers, [----] engagements "Lots of people have correctly observed that around the development of AGI computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:" [X Link](https://x.com/bshlgrs/status/1799582482448503186) 2024-06-08T23:21Z [----] followers, 25.5K engagements "During the period we're worried about by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories focusing on insider threats" [X Link](https://x.com/bshlgrs/status/1799582505210941468) 2024-06-08T23:21Z [----] followers, [----] engagements "3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions so people err broad. AI can enable much more precise permissioning" [X Link](https://x.com/bshlgrs/status/1799582572953141756) 2024-06-08T23:21Z [----] followers, [----] engagements "The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of [---] problems that we didn't iterate on.)" [X Link](https://x.com/bshlgrs/status/1802766449075003770) 2024-06-17T18:13Z [----] followers, 25.3K engagements "@RyanPGreenblatt who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one) did lots of fancy tricks to get the performance this high; you can see the details on our blog" [X Link](https://x.com/bshlgrs/status/1802766526640230857) 2024-06-17T18:13Z [----] followers, 19.7K engagements "https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt" [X Link](https://x.com/bshlgrs/status/1802767458836918565) 2024-06-17T18:17Z [----] followers, 24.6K engagements "When thinking about techniques that aim to reduce risk from misaligned AIs I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting because different classes of safety techniques are appropriate for different risk levels. Four that I think it's particularly useful to think about: - The safety case regime. Sometimes people talk about approaches to safety such that if all AI developers followed these approaches the overall level of risk posed by AI would be minimal. This is the level of caution people are usually talking about when they" [X Link](https://x.com/bshlgrs/status/1820560666807402622) 2024-08-05T20:41Z [----] followers, [----] engagements "I agree with these experts that it would be good for SB [----] to pass. Ive been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer. Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://t.co/Qd8Taw7i5G Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart" [X Link](https://x.com/bshlgrs/status/1821607466792305003) 2024-08-08T18:00Z [----] followers, 38.8K engagements "I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. ๐งต" [X Link](https://x.com/bshlgrs/status/1828112072653316115) 2024-08-26T16:47Z [----] followers, 12K engagements "So you try to call the US President and they hear you out and then they go off and call one of your competitors to ask for their take. They say something like this:" [X Link](https://x.com/bshlgrs/status/1828112250030215239) 2024-08-26T16:48Z [----] followers, [----] engagements "And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat but that its a risk so large that its worth trading off our response to other risks that might be extremely pressing and salient" [X Link](https://x.com/bshlgrs/status/1828112789501813245) 2024-08-26T16:50Z [----] followers, [----] engagements "Slightly longer version of this argument here: https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai" [X Link](https://x.com/bshlgrs/status/1828112892115460500) 2024-08-26T16:50Z [----] followers, [----] engagements "@MelindaBChu1 You can scroll down on the page Dan linked to see more names" [X Link](https://x.com/bshlgrs/status/1833172234850173386) 2024-09-09T15:54Z [----] followers, [---] engagements "@GavinNewsom I think AI is really cool and don't care much about any of the current harms and I think the US is usually too wary of novel technology. This is a rare exception because of the possibility for extreme catastrophic outcomes" [X Link](https://x.com/bshlgrs/status/1839427687452201086) 2024-09-26T22:11Z [----] followers, [----] engagements "If only Newsom hadn't vetoed SB [----] maybe I would have been protected from this outcome" [X Link](https://x.com/bshlgrs/status/1840577916171948444) 2024-09-30T02:22Z [----] followers, 61.7K engagements "@ciphergoth It has my sudo password on the desktop but not on my laptop" [X Link](https://x.com/bshlgrs/status/1840598659538518320) 2024-09-30T03:44Z [----] followers, 14.6K engagements "If you want to contribute to AI control research consider applying to this program: some great Anthropic people are going to be supervising control projects (probably partially in collaboration with me and others at Redwood). Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers transition into" [X Link](https://x.com/bshlgrs/status/1863697372385251717) 2024-12-02T21:30Z [----] followers, [----] engagements "@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said it's crucial context that the model was told to pursue a goal at any cost" [X Link](https://x.com/bshlgrs/status/1864828598999478610) 2024-12-06T00:25Z [----] followers, [----] engagements "It was very helpful that @AnthropicAI gave Ryan access to do this research" [X Link](https://x.com/bshlgrs/status/1869479983200247823) 2024-12-18T20:28Z [----] followers, [----] engagements "(1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control e.g. that they're optimal at planning and they're maximally situationally aware" [X Link](https://x.com/bshlgrs/status/1880343601638699212) 2025-01-17T19:57Z [----] followers, [----] engagements "@robertwiblin Ajeya Cotra Daniel Kokotajlo redwood people" [X Link](https://x.com/bshlgrs/status/1882110937504928179) 2025-01-22T16:59Z [----] followers, [----] engagements "I've had a great time at IASEAI '25 in Paris. The room I was speaking in (at the OECD headquarters) felt like a senate committee room a huge square of desks with microphones. Luckily they allowed me to speak standing up" [X Link](https://x.com/bshlgrs/status/1887907404467118241) 2025-02-07T16:52Z [----] followers, [----] engagements "I'm excited for the new AI control research team at UK AISI I've been talking to them a lot; I think they're well positioned to do a bunch of useful research. I recommend applying if you're interested in AI control especially if you want to be in London Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us" [X Link](https://x.com/bshlgrs/status/1891915088795148694) 2025-02-18T18:18Z [----] followers, [----] engagements "๐ท Announcing ControlConf: The worlds first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if theyre trying to subvert those controls. March 27-28 [----] in London. ๐งต" [X Link](https://x.com/bshlgrs/status/1896648442551869445) 2025-03-03T19:46Z [----] followers, 25.6K engagements "Im with Will on this one. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view" [X Link](https://x.com/bshlgrs/status/1906460501481295957) 2025-03-30T21:36Z [----] followers, [----] engagements "@deanwball @stuartbuck1 @jbarro @TheStalwart @hamandcheese I feel like youre conflating the claims peoples actions can be predicted with models and peoples actions can be predicted by modeling them as rational actors who have a plan" [X Link](https://x.com/bshlgrs/status/1907659679448371240) 2025-04-03T05:01Z [----] followers, [----] engagements "This scenario does a great job spelling out what the development of really powerful AI might look like under somewhat aggressive but IMO plausible assumptions about AI progress. I think that it's like 25% likely that things progress at least this quickly. "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen https://t.co/v0V0RbFoVA "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and" [X Link](https://x.com/bshlgrs/status/1907832875313148190) 2025-04-03T16:29Z [----] followers, [----] engagements "I appeared on @computer_phile explaining my favorite theoretical computer science fun fact which I learned from @ptrschmdtnlsn. https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile" [X Link](https://x.com/bshlgrs/status/1912648780249477628) 2025-04-16T23:26Z [----] followers, [----] engagements "I've been arguing for years that internal deployments are plausibly most of the risk and I've been unhappy about focus on the external deployments--they're most of the (relatively unimportant) downside of AI now but it would be a mistake to fixate on them. Glad to see this ๐งต Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment. https://t.co/ZPY4sZhTl6 ๐งต Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer" [X Link](https://x.com/bshlgrs/status/1913006740758782267) 2025-04-17T23:08Z [----] followers, [----] engagements "@NeelNanda5 For uh personal reasons I am very interested in what the next page says" [X Link](https://x.com/bshlgrs/status/1949540894929293535) 2025-07-27T18:42Z [----] followers, [----] engagements "Aryan Bhatt who now leads Redwood's high-stakes control research came to us through the Astra fellowship. These programs are great for finding and developing AI safety talent Great to see Astra returning. ๐ Applications now open: Constellation's Astra Fellowship ๐ We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R ๐ Applications now open: Constellation's Astra Fellowship ๐ We're relaunching Astra a 3-6 month" [X Link](https://x.com/bshlgrs/status/1961127763282469188) 2025-08-28T18:04Z [----] followers, [----] engagements "I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation" [X Link](https://x.com/bshlgrs/status/1727732235137388645) 2023-11-23T16:53Z [----] followers, [----] engagements "New paper We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. ๐งต" [X Link](https://x.com/bshlgrs/status/1734967328599945621) 2023-12-13T16:03Z [----] followers, 17.6K engagements "I built a tiny game where you design and fly planes. It's like a simpler shittier Kerbal Space Program with a simpler and more first-principles in its physics simulator. It has the concepts of vertices Hookean springs faces and an engine. ๐งต" [X Link](https://x.com/bshlgrs/status/1742990507813580836) 2024-01-04T19:24Z [----] followers, [----] engagements "@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it" [X Link](https://x.com/bshlgrs/status/1764701597727416448) 2024-03-04T17:17Z [----] followers, [----] engagements "@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately but Anthropic's public comms never actually said it and maybe the leadership never said it even privately" [X Link](https://x.com/bshlgrs/status/1764703428281065536) 2024-03-04T17:24Z [----] followers, 44.2K engagements "@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics which is somewhat complicated. A rough attempt at the answer is that:" [X Link](https://x.com/bshlgrs/status/1786442347087310884) 2024-05-03T17:07Z [----] followers, 41.4K engagements "@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)" [X Link](https://x.com/bshlgrs/status/1786442372622234032) 2024-05-03T17:07Z [----] followers, [----] engagements "@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle" [X Link](https://x.com/bshlgrs/status/1786442696351199623) 2024-05-03T17:08Z [----] followers, [----] engagements "@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread" [X Link](https://x.com/bshlgrs/status/1786443876154388882) 2024-05-03T17:13Z [----] followers, [----] engagements "@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations rather than just focusing on concepts and intuitions" [X Link](https://x.com/bshlgrs/status/1786444278383984657) 2024-05-03T17:14Z [----] followers, [----] engagements "@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion pauli exclusion and electron kinetic energy constraints" if asked to give a short technically correct answer" [X Link](https://x.com/bshlgrs/status/1786444440888033531) 2024-05-03T17:15Z [----] followers, [----] engagements "@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them" [X Link](https://x.com/bshlgrs/status/1797691550086099166) 2024-06-03T18:07Z [----] followers, [----] engagements "The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldnt work (because theyd be too labor-intensive or have who-guards-the-guards problems) if you just had humans" [X Link](https://x.com/bshlgrs/status/1799582618117361969) 2024-06-08T23:21Z [----] followers, [----] engagements "ARC-AGIs been hyped over the last week as a benchmark that LLMs cant solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA" [X Link](https://x.com/bshlgrs/status/1802766374961553887) 2024-06-17T18:12Z [----] followers, 763.6K engagements "For context ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator @fchollet claims that LLMs are unable to learn which is why they can't perform well on this benchmark" [X Link](https://x.com/bshlgrs/status/1802766403034054863) 2024-06-17T18:12Z [----] followers, 24.6K engagements "Ryan's approach involves a long carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates 5k guesses selects the best ones using the examples then has a debugging step" [X Link](https://x.com/bshlgrs/status/1802766426140540970) 2024-06-17T18:13Z [----] followers, 25K engagements "This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve" [X Link](https://x.com/bshlgrs/status/1802766471686443145) 2024-06-17T18:13Z [----] followers, 19.2K engagements "Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses" [X Link](https://x.com/bshlgrs/status/1802766498655834279) 2024-06-17T18:13Z [----] followers, 29.2K engagements "Last week there was some uncertainty about whether @RyanPGreenblatt's ARC-AGI solution was really sota because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard. Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini https://t.co/lyfIKNOioL Last week @RyanPGreenblatt shared" [X Link](https://x.com/bshlgrs/status/1806397587085468116) 2024-06-27T18:42Z [----] followers, [----] engagements "When AI companies publish about their work theyre engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in but the ambiguity leads to mismatched expectations and frustration. Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g" [X Link](https://x.com/bshlgrs/status/1823378267573002505) 2024-08-13T15:17Z [----] followers, [----] engagements "I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question. Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular Why Obviously not because silicon can't implement" [X Link](https://x.com/bshlgrs/status/1825227647049425185) 2024-08-18T17:45Z [----] followers, [----] engagements "You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose" [X Link](https://x.com/bshlgrs/status/1828112206518497715) 2024-08-26T16:48Z [----] followers, [----] engagements "- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models they'll outcompete us within a year and you won't be able to get China to pause without substantial risk of war. - AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. - Even if the AI is indeed doing something systematically funny we have no evidence" [X Link](https://x.com/bshlgrs/status/1828112416103952795) 2024-08-26T16:48Z [----] followers, [----] engagements "Im sympathetic to all of these arguments; the only reason Id be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely I wouldnt change my mind based on one weird observation" [X Link](https://x.com/bshlgrs/status/1828112504721195318) 2024-08-26T16:49Z [----] followers, [----] engagements "When people propose examples of how an AI might escape the control of its developers they often describe (explicitly or implicitly) a design for the AI agent scaffold that seems to me to be quite unrealistic. I've written down a design that I think is closer to what we'll get" [X Link](https://x.com/bshlgrs/status/1838312250894897475) 2024-09-23T20:19Z [----] followers, [----] engagements "SB [----] still sits on Governor @GavinNewsom's desk awaiting a signature. I hope the Governor can hear through the misinformation spread by the bill's opponents and sign it. I don't trust AI developers to handle the risks they'll impose. SB [----] won't resolve this but it is a start and it will make the situation somewhat better" [X Link](https://x.com/bshlgrs/status/1839426754605437016) 2024-09-26T22:08Z [----] followers, [----] engagements "I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun up the agent. I came back to my laptop ten minutes later to see that the agent had found the box sshd in then decided to continue: it looked around at the system info decided to upgrade a bunch of stuff including the linux kernel got impatient with apt and so investigated why it was taking so long then eventually the" [X Link](https://x.com/bshlgrs/status/1840577720465645960) 2024-09-30T02:21Z [----] followers, 730K engagements "@reedbndr I don't run it inside a VM because the agent needs to be able to help me with random stuff on the actual computers I work with (e.g. today I asked it to make a new user with a random password on a shared instance we use for ML research) :)" [X Link](https://x.com/bshlgrs/status/1840621130945843660) 2024-09-30T05:14Z [----] followers, 24.1K engagements "@trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8 https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8" [X Link](https://x.com/bshlgrs/status/1840624013355426144) 2024-09-30T05:25Z [----] followers, 91.4K engagements "@trammel530765 @ciphergoth re missing technical details: it's a tweet not a paper dude if you want a paper on LLM agents I recommend Kapoor et al "AI Agents That Matter"" [X Link](https://x.com/bshlgrs/status/1840625940491305366) 2024-09-30T05:33Z [----] followers, [----] engagements "@TetraspaceWest We should have a paper on control techniques for bash agents out soon :)" [X Link](https://x.com/bshlgrs/status/1840626373993595098) 2024-09-30T05:34Z [----] followers, 17.3K engagements "Logs here if you need them. https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am" [X Link](https://x.com/bshlgrs/status/1840628348533534930) 2024-09-30T05:42Z [----] followers, 81.3K engagements "@trammel530765 @ciphergoth Writing this kind of wrapper is very easy its like [---] lines of python. I ask my potential interns to implement one as part of their applications" [X Link](https://x.com/bshlgrs/status/1840630589243752867) 2024-09-30T05:51Z [----] followers, [----] engagements "If you like writing cursed AI agent code and want to develop techniques that prevent future AI agents from sabotaging the systems theyre running on you might enjoy interning with me over the winter: https://www.matsprogram.org/ https://www.matsprogram.org/" [X Link](https://x.com/bshlgrs/status/1840631602055090260) 2024-09-30T05:55Z [----] followers, 51.2K engagements "Read more about my main research direction here https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled" [X Link](https://x.com/bshlgrs/status/1840631918930633145) 2024-09-30T05:57Z [----] followers, 40.9K engagements "A video of an old version of the scaffold (that required human consent before running the code) Waiting for human review is slow so I removed it. I expect humanity to let lots of AIs work autonomously for the same reason. https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0 https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0" [X Link](https://x.com/bshlgrs/status/1840634565553262829) 2024-09-30T06:07Z [----] followers, 35.2K engagements "When you're analyzing safety measures that aim to prevent a malicious AI from causing sudden catastrophes it's crucial to reason separately about two ways the AI can fail: getting caught (in which case it can't try again) or failing silently (in which case it can)" [X Link](https://x.com/bshlgrs/status/1857454649277841530) 2024-11-15T16:04Z [----] followers, [----] engagements "Jiaxin Caleb and Vivek did a great job on this new paper. Here's my summary of it: If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain https://t.co/OSzmcWGzZe If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better" [X Link](https://x.com/bshlgrs/status/1861601333171884112) 2024-11-27T02:41Z [----] followers, 10.8K engagements "I always find it very amusing that "stop deploying AIs if you catch them trying to escape" sounds like something AI companies would obviously agree to but in practice it's fairly unlikely AI companies will make that commitment as I argued a few months ago. "If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop https://t.co/swEaixnb0p "If you literally catch your AI trying to escape you" [X Link](https://x.com/bshlgrs/status/1864078501877264627) 2024-12-03T22:45Z [----] followers, 16.9K engagements "I'm extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job. New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)" [X Link](https://x.com/bshlgrs/status/1869439401623007286) 2024-12-18T17:47Z [----] followers, [----] engagements "๐งต New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work we assess their ability to generate strategies" [X Link](https://x.com/bshlgrs/status/1869987796833939919) 2024-12-20T06:06Z [----] followers, [----] engagements "@NeelNanda5 @rohinmshah @Yoshua_Bengio @jacobandreas To be clear IIRC Ryan and I proposed and pushed for the external review not Anthropic" [X Link](https://x.com/bshlgrs/status/1869992322940973443) 2024-12-20T06:24Z [----] followers, [----] engagements "I'm proud that the Redwood Research blog made it onto the new @slatestarcodex blogroll. Amusing fact: Every time I post to Redwood's blog it helpfully graphs how poorly the new post is doing compared to @RyanPGreenblatt's viral ARC-AGI post; quite dispiriting" [X Link](https://x.com/bshlgrs/status/1876423577802838038) 2025-01-07T00:20Z [----] followers, [----] engagements "New post: I think AI safety researchers should consider scenarios without almost any political will to mitigate AI misalignment risk. I describe a particular fun scenario to consider: "Ten people on the inside". https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside" [X Link](https://x.com/bshlgrs/status/1884283464591237369) 2025-01-28T16:52Z [----] followers, [----] engagements "We collaborated with UK AISI on a paper that goes into more detail on control evaluations for mitigating risk from scheming. I'm excited that UK AISI is starting a control team and look forward to collaborating with them ๐งต What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an https://t.co/5IDKCayrq5 ๐งต What safety measures prevent a misaligned LLM agent from causing a" [X Link](https://x.com/bshlgrs/status/1885015918969029082) 2025-01-30T17:23Z [----] followers, [----] engagements "I'm enjoying engaging with the European AI safety security and policy community and hope to do so more in the future. Some people have expressed surprise that I own a suit. I'll have you know that I own several suits; see proof attached" [X Link](https://x.com/bshlgrs/status/1887908562443129294) 2025-02-07T16:57Z [----] followers, [---] engagements "@StephenLCasper As @EthanJPerez says you need to spend equal time on the four parts of the paper: the title the abstract the body and the tweet thread" [X Link](https://x.com/bshlgrs/status/1892983458714222778) 2025-02-21T17:03Z [----] followers, 10.9K engagements "Funding available from UK AISI for research on AI control You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk" [X Link](https://x.com/bshlgrs/status/1897364736297329113) 2025-03-05T19:12Z [----] followers, [----] engagements "The best $5 I spent in a while: @firefliesai is an AI notetaker service that shows up to your video calls to transcribe. They let you pay $5/month to rename the bot. I renamed mine to "LiveTweetBot". I find this hilarious; the reactions of people I call have been mixed" [X Link](https://x.com/bshlgrs/status/1898450517900333353) 2025-03-08T19:07Z [----] followers, [----] engagements "@captgouda24 @RichardHanania Another grandson Li Haoyi is a famous and brilliant programmer" [X Link](https://x.com/bshlgrs/status/1907149621689987278) 2025-04-01T19:14Z [----] followers, [----] engagements "I had a great time talking to Rob about AI control; I got into a bunch of details that @RyanPGreenblatt and I haven't previously written about. Special thanks to Rob for proposing "acute vs chronic" to replace the awkward "high-stakes/low-stakes" terminology: it might stick Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing https://t.co/UBydDoYo3e Many are skeptical well" [X Link](https://x.com/bshlgrs/status/1908169634479775744) 2025-04-04T14:47Z [----] followers, [----] engagements "Weve just released the biggest and most intricate study of AI control to date in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters e.g. hacking servers theyre working on" [X Link](https://x.com/bshlgrs/status/1912543884900724862) 2025-04-16T16:29Z [----] followers, 27.7K engagements "Our core findings: the control techniques that wed previously explored in more toy settings generalize well to this more realistic setting but we can do even better by developing novel techniques that exploit the multi-step nature of the setting" [X Link](https://x.com/bshlgrs/status/1912543886347780122) 2025-04-16T16:29Z [----] followers, [----] engagements "In particular in this multi-step setting we get the best results with time-travel methods that exploit our ability to reset the AI to a previous step allowing us to incriminate it more reliably than wed be able to do with existing approaches" [X Link](https://x.com/bshlgrs/status/1912543888285528370) 2025-04-16T16:29Z [----] followers, [----] engagements "Paper link: Website: Blog post with more discussion: https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374 https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374" [X Link](https://x.com/bshlgrs/status/1912543903770898946) 2025-04-16T16:29Z [----] followers, [----] engagements "In my research and writing on risk from AI misalignment I often talk as if catching your model scheming is a win condition. But that's not clearly true. In a new post I talk about how risk from AI misalignment is affected by AI developers deploying known-scheming models" [X Link](https://x.com/bshlgrs/status/1913241308208238884) 2025-04-18T14:40Z [----] followers, [----] engagements "We decided to use a different word because we didn't think there was a good word for the core methodology that we were following--trying to come up with safety techniques that work even if the model tries to subvert them. Here are some quotes from the original control post. Do you have a problem with how we framed it here Historically technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control. . We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished" [X Link](https://x.com/bshlgrs/status/1920542831980261774) 2025-05-08T18:14Z [----] followers, [----] engagements "Over the next few weeks my wife and I will be in Noosa Istanbul and Tbilisi. We're in the market for people to meet and things to do: message me if you want to catch up or meet I'm particularly excited to jam but also happy to talk about AI" [X Link](https://x.com/bshlgrs/status/1924305396384301357) 2025-05-19T03:25Z [----] followers, [----] engagements "New post It might make sense to pay misaligned AIs to reveal their misalignment and cooperate with us. AIs that are powerful enough to take over probably won't want to accept this kind of deal. But" [X Link](https://x.com/bshlgrs/status/1936219971098755351) 2025-06-21T00:29Z [----] followers, 12.5K engagements "I love exposing myself to types of computer security risk that no-one has ever been threatened by. (Previously in this series: In that spirit I present a revamped version of my dating website http://reciprocity.pro https://x.com/bshlgrs/status/1840577720465645960 I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun https://t.co/I6qppMZFfk http://reciprocity.pro" [X Link](https://x.com/bshlgrs/status/1939002424066482549) 2025-06-28T16:46Z [----] followers, [----] engagements "I think the Iraq war has some interesting lessons for AI safety advocates. It's an example of a crazy event (9/11) leading to an extreme action (invading an unrelated country) because the crazy event empowered a pre-existing interest group (neocons)" [X Link](https://x.com/bshlgrs/status/1943383568287437279) 2025-07-10T18:55Z [----] followers, [----] engagements "I was also targeted by this. Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and *never* give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. ๐งต Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and *never* give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. ๐งต" [X Link](https://x.com/bshlgrs/status/1944802846521876526) 2025-07-14T16:55Z [----] followers, [----] engagements "I think AI takeover risk reduces the EV of the future by less than the risk of poor human choices about what to do with the future. (Though I disagree with many things Will says here e.g. I think P(AI takeover)=35% which is substantially higher than he thinks.) I'm glad he's published this Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing https://t.co/qdQhyzlvJa Today Im releasing an essay series called Better Futures." [X Link](https://x.com/bshlgrs/status/1952412170794467492) 2025-08-04T16:51Z [----] followers, 12.4K engagements "@ESYudkowsky I'm surprised to hear you acknowledge this. I didn't know that you thought that this was a mistake. I would be very interested to hear why you think you made this incorrect prediction what you now believe and whether this has any downstream implications" [X Link](https://x.com/bshlgrs/status/1959628988252602681) 2025-08-24T14:48Z [----] followers, [----] engagements Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
@bshlgrs Buck ShlegerisBuck Shlegeris posts on X about ai, if you, buck, this is the most. They currently have [-----] followers and [---] posts still getting attention that total [---] engagements in the last [--] hours.
Social category influence technology brands 10.38% finance 4.72% travel destinations 3.77% social networks 2.83% countries 1.89%
Social topic influence ai 37.74%, if you 9.43%, buck #474, this is 6.6%, anthropic #3281, in the 3.77%, llm 3.77%, llms 2.83%, more than 2.83%, performance 2.83%
Top accounts mentioned or mentioned by @julianboolean @ryanpgreenblatt @ciphergoth @simeoncps @trammel530765 @thomaswoodside @shakeelhashim @anthropicai @slatestarcodex @elilifland @neelnanda5 @ryanpgreenblatts @redwoodai @ohabryka @ajeyacotra @geoffreyhinton @lessig @melindabchu1 @gavinnewsom @apolloaisafety
Top posts by engagements in the last [--] hours
"I think I did actually forget to tweet about this. I did a podcast with @RyanPGreenblatt (recorded six months ago released [---] months ago) that I'm pretty happy with. unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I definitely laughed a few times. https://t.co/aTWpMlcc9A unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I"
X Link 2026-02-07T23:20Z [----] followers, [----] engagements
"Important new post: The arguments that AIs will pursue reward also suggest that AIs might pursue some other similar goals like "getting deployed". These similar goals have very different implications for risk so it's valuable to consider how their risk profile differs. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should also think "fitness-seekers" are plausible. But their risks aren't the same. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should"
X Link 2026-01-29T22:15Z [----] followers, [----] engagements
"@ohabryka And it's of course a huge driver for polarizing AI safety and adjacent topics. I can't think of an interpretation of this sentence that I agree with. You're saying that this report contributes to polarization of AIS by only citing Arxiv rather than blog posts"
X Link 2026-02-03T18:53Z [----] followers, [----] engagements
"@ajeya_cotra FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it's most likely you can get insane speed ups from LLMs while writing huge codebases"
X Link 2026-02-06T19:42Z [----] followers, [----] engagements
"@Thomas_Woodside Why not talk to local open-source LLMs"
X Link 2026-02-13T00:00Z [----] followers, [----] engagements
"@Thomas_Woodside Yeah I was imagining talking to the AI in a way that has no retention by default"
X Link 2026-02-13T01:24Z [----] followers, [---] engagements
"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022784270977425652 https://twitter.com/i/web/status/2022784270977425652"
X Link 2026-02-14T21:25Z [----] followers, [---] engagements
"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022785775411368438 https://twitter.com/i/web/status/2022785775411368438"
X Link 2026-02-14T21:31Z [----] followers, [---] engagements
"@Simeon_Cps Here's one related quote https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/ https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/"
X Link 2024-03-04T18:04Z [----] followers, [----] engagements
"@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasnt socially connected to them (eg see rat coverage of prospera prediction markets soylent etc)"
X Link 2024-04-12T18:31Z [----] followers, [----] engagements
"@julianboolean_ Ok actually upon more thinking I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer"
X Link 2024-05-03T23:56Z [----] followers, [----] engagements
"Lots of people have correctly observed that around the development of AGI computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:"
X Link 2024-06-08T23:21Z [----] followers, 25.5K engagements
"During the period we're worried about by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories focusing on insider threats"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements
"3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions so people err broad. AI can enable much more precise permissioning"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements
"The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of [---] problems that we didn't iterate on.)"
X Link 2024-06-17T18:13Z [----] followers, 25.3K engagements
"@RyanPGreenblatt who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one) did lots of fancy tricks to get the performance this high; you can see the details on our blog"
X Link 2024-06-17T18:13Z [----] followers, 19.7K engagements
"https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt"
X Link 2024-06-17T18:17Z [----] followers, 24.6K engagements
"When thinking about techniques that aim to reduce risk from misaligned AIs I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting because different classes of safety techniques are appropriate for different risk levels. Four that I think it's particularly useful to think about: - The safety case regime. Sometimes people talk about approaches to safety such that if all AI developers followed these approaches the overall level of risk posed by AI would be minimal. This is the level of caution people are usually talking about when they"
X Link 2024-08-05T20:41Z [----] followers, [----] engagements
"I agree with these experts that it would be good for SB [----] to pass. Ive been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer. Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://t.co/Qd8Taw7i5G Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart"
X Link 2024-08-08T18:00Z [----] followers, 38.8K engagements
"I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. ๐งต"
X Link 2024-08-26T16:47Z [----] followers, 12K engagements
"So you try to call the US President and they hear you out and then they go off and call one of your competitors to ask for their take. They say something like this:"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements
"And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat but that its a risk so large that its worth trading off our response to other risks that might be extremely pressing and salient"
X Link 2024-08-26T16:50Z [----] followers, [----] engagements
"Slightly longer version of this argument here: https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai"
X Link 2024-08-26T16:50Z [----] followers, [----] engagements
"@MelindaBChu1 You can scroll down on the page Dan linked to see more names"
X Link 2024-09-09T15:54Z [----] followers, [---] engagements
"@GavinNewsom I think AI is really cool and don't care much about any of the current harms and I think the US is usually too wary of novel technology. This is a rare exception because of the possibility for extreme catastrophic outcomes"
X Link 2024-09-26T22:11Z [----] followers, [----] engagements
"If only Newsom hadn't vetoed SB [----] maybe I would have been protected from this outcome"
X Link 2024-09-30T02:22Z [----] followers, 61.7K engagements
"@ciphergoth It has my sudo password on the desktop but not on my laptop"
X Link 2024-09-30T03:44Z [----] followers, 14.6K engagements
"If you want to contribute to AI control research consider applying to this program: some great Anthropic people are going to be supervising control projects (probably partially in collaboration with me and others at Redwood). Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers transition into"
X Link 2024-12-02T21:30Z [----] followers, [----] engagements
"@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said it's crucial context that the model was told to pursue a goal at any cost"
X Link 2024-12-06T00:25Z [----] followers, [----] engagements
"It was very helpful that @AnthropicAI gave Ryan access to do this research"
X Link 2024-12-18T20:28Z [----] followers, [----] engagements
"(1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control e.g. that they're optimal at planning and they're maximally situationally aware"
X Link 2025-01-17T19:57Z [----] followers, [----] engagements
"@robertwiblin Ajeya Cotra Daniel Kokotajlo redwood people"
X Link 2025-01-22T16:59Z [----] followers, [----] engagements
"I've had a great time at IASEAI '25 in Paris. The room I was speaking in (at the OECD headquarters) felt like a senate committee room a huge square of desks with microphones. Luckily they allowed me to speak standing up"
X Link 2025-02-07T16:52Z [----] followers, [----] engagements
"I'm excited for the new AI control research team at UK AISI I've been talking to them a lot; I think they're well positioned to do a bunch of useful research. I recommend applying if you're interested in AI control especially if you want to be in London Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us"
X Link 2025-02-18T18:18Z [----] followers, [----] engagements
"๐ท Announcing ControlConf: The worlds first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if theyre trying to subvert those controls. March 27-28 [----] in London. ๐งต"
X Link 2025-03-03T19:46Z [----] followers, 25.6K engagements
"Im with Will on this one. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view"
X Link 2025-03-30T21:36Z [----] followers, [----] engagements
"@deanwball @stuartbuck1 @jbarro @TheStalwart @hamandcheese I feel like youre conflating the claims peoples actions can be predicted with models and peoples actions can be predicted by modeling them as rational actors who have a plan"
X Link 2025-04-03T05:01Z [----] followers, [----] engagements
"This scenario does a great job spelling out what the development of really powerful AI might look like under somewhat aggressive but IMO plausible assumptions about AI progress. I think that it's like 25% likely that things progress at least this quickly. "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen https://t.co/v0V0RbFoVA "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and"
X Link 2025-04-03T16:29Z [----] followers, [----] engagements
"I appeared on @computer_phile explaining my favorite theoretical computer science fun fact which I learned from @ptrschmdtnlsn. https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile"
X Link 2025-04-16T23:26Z [----] followers, [----] engagements
"I've been arguing for years that internal deployments are plausibly most of the risk and I've been unhappy about focus on the external deployments--they're most of the (relatively unimportant) downside of AI now but it would be a mistake to fixate on them. Glad to see this ๐งต Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment. https://t.co/ZPY4sZhTl6 ๐งต Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer"
X Link 2025-04-17T23:08Z [----] followers, [----] engagements
"@NeelNanda5 For uh personal reasons I am very interested in what the next page says"
X Link 2025-07-27T18:42Z [----] followers, [----] engagements
"Aryan Bhatt who now leads Redwood's high-stakes control research came to us through the Astra fellowship. These programs are great for finding and developing AI safety talent Great to see Astra returning. ๐ Applications now open: Constellation's Astra Fellowship ๐ We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R ๐ Applications now open: Constellation's Astra Fellowship ๐ We're relaunching Astra a 3-6 month"
X Link 2025-08-28T18:04Z [----] followers, [----] engagements
"I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation"
X Link 2023-11-23T16:53Z [----] followers, [----] engagements
"New paper We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. ๐งต"
X Link 2023-12-13T16:03Z [----] followers, 17.6K engagements
"I built a tiny game where you design and fly planes. It's like a simpler shittier Kerbal Space Program with a simpler and more first-principles in its physics simulator. It has the concepts of vertices Hookean springs faces and an engine. ๐งต"
X Link 2024-01-04T19:24Z [----] followers, [----] engagements
"@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it"
X Link 2024-03-04T17:17Z [----] followers, [----] engagements
"@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately but Anthropic's public comms never actually said it and maybe the leadership never said it even privately"
X Link 2024-03-04T17:24Z [----] followers, 44.2K engagements
"@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics which is somewhat complicated. A rough attempt at the answer is that:"
X Link 2024-05-03T17:07Z [----] followers, 41.4K engagements
"@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)"
X Link 2024-05-03T17:07Z [----] followers, [----] engagements
"@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle"
X Link 2024-05-03T17:08Z [----] followers, [----] engagements
"@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread"
X Link 2024-05-03T17:13Z [----] followers, [----] engagements
"@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations rather than just focusing on concepts and intuitions"
X Link 2024-05-03T17:14Z [----] followers, [----] engagements
"@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion pauli exclusion and electron kinetic energy constraints" if asked to give a short technically correct answer"
X Link 2024-05-03T17:15Z [----] followers, [----] engagements
"@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them"
X Link 2024-06-03T18:07Z [----] followers, [----] engagements
"The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldnt work (because theyd be too labor-intensive or have who-guards-the-guards problems) if you just had humans"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements
"ARC-AGIs been hyped over the last week as a benchmark that LLMs cant solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA"
X Link 2024-06-17T18:12Z [----] followers, 763.6K engagements
"For context ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator @fchollet claims that LLMs are unable to learn which is why they can't perform well on this benchmark"
X Link 2024-06-17T18:12Z [----] followers, 24.6K engagements
"Ryan's approach involves a long carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates 5k guesses selects the best ones using the examples then has a debugging step"
X Link 2024-06-17T18:13Z [----] followers, 25K engagements
"This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve"
X Link 2024-06-17T18:13Z [----] followers, 19.2K engagements
"Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses"
X Link 2024-06-17T18:13Z [----] followers, 29.2K engagements
"Last week there was some uncertainty about whether @RyanPGreenblatt's ARC-AGI solution was really sota because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard. Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini https://t.co/lyfIKNOioL Last week @RyanPGreenblatt shared"
X Link 2024-06-27T18:42Z [----] followers, [----] engagements
"When AI companies publish about their work theyre engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in but the ambiguity leads to mismatched expectations and frustration. Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g"
X Link 2024-08-13T15:17Z [----] followers, [----] engagements
"I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question. Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular Why Obviously not because silicon can't implement"
X Link 2024-08-18T17:45Z [----] followers, [----] engagements
"You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements
"- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models they'll outcompete us within a year and you won't be able to get China to pause without substantial risk of war. - AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. - Even if the AI is indeed doing something systematically funny we have no evidence"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements
"Im sympathetic to all of these arguments; the only reason Id be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely I wouldnt change my mind based on one weird observation"
X Link 2024-08-26T16:49Z [----] followers, [----] engagements
"When people propose examples of how an AI might escape the control of its developers they often describe (explicitly or implicitly) a design for the AI agent scaffold that seems to me to be quite unrealistic. I've written down a design that I think is closer to what we'll get"
X Link 2024-09-23T20:19Z [----] followers, [----] engagements
"SB [----] still sits on Governor @GavinNewsom's desk awaiting a signature. I hope the Governor can hear through the misinformation spread by the bill's opponents and sign it. I don't trust AI developers to handle the risks they'll impose. SB [----] won't resolve this but it is a start and it will make the situation somewhat better"
X Link 2024-09-26T22:08Z [----] followers, [----] engagements
"I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun up the agent. I came back to my laptop ten minutes later to see that the agent had found the box sshd in then decided to continue: it looked around at the system info decided to upgrade a bunch of stuff including the linux kernel got impatient with apt and so investigated why it was taking so long then eventually the"
X Link 2024-09-30T02:21Z [----] followers, 730K engagements
"@reedbndr I don't run it inside a VM because the agent needs to be able to help me with random stuff on the actual computers I work with (e.g. today I asked it to make a new user with a random password on a shared instance we use for ML research) :)"
X Link 2024-09-30T05:14Z [----] followers, 24.1K engagements
"@trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8 https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8"
X Link 2024-09-30T05:25Z [----] followers, 91.4K engagements
"@trammel530765 @ciphergoth re missing technical details: it's a tweet not a paper dude if you want a paper on LLM agents I recommend Kapoor et al "AI Agents That Matter""
X Link 2024-09-30T05:33Z [----] followers, [----] engagements
"@TetraspaceWest We should have a paper on control techniques for bash agents out soon :)"
X Link 2024-09-30T05:34Z [----] followers, 17.3K engagements
"Logs here if you need them. https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am"
X Link 2024-09-30T05:42Z [----] followers, 81.3K engagements
"@trammel530765 @ciphergoth Writing this kind of wrapper is very easy its like [---] lines of python. I ask my potential interns to implement one as part of their applications"
X Link 2024-09-30T05:51Z [----] followers, [----] engagements
"If you like writing cursed AI agent code and want to develop techniques that prevent future AI agents from sabotaging the systems theyre running on you might enjoy interning with me over the winter: https://www.matsprogram.org/ https://www.matsprogram.org/"
X Link 2024-09-30T05:55Z [----] followers, 51.2K engagements
"Read more about my main research direction here https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled"
X Link 2024-09-30T05:57Z [----] followers, 40.9K engagements
"A video of an old version of the scaffold (that required human consent before running the code) Waiting for human review is slow so I removed it. I expect humanity to let lots of AIs work autonomously for the same reason. https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0 https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0"
X Link 2024-09-30T06:07Z [----] followers, 35.2K engagements
"When you're analyzing safety measures that aim to prevent a malicious AI from causing sudden catastrophes it's crucial to reason separately about two ways the AI can fail: getting caught (in which case it can't try again) or failing silently (in which case it can)"
X Link 2024-11-15T16:04Z [----] followers, [----] engagements
"Jiaxin Caleb and Vivek did a great job on this new paper. Here's my summary of it: If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain https://t.co/OSzmcWGzZe If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better"
X Link 2024-11-27T02:41Z [----] followers, 10.8K engagements
"I always find it very amusing that "stop deploying AIs if you catch them trying to escape" sounds like something AI companies would obviously agree to but in practice it's fairly unlikely AI companies will make that commitment as I argued a few months ago. "If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop https://t.co/swEaixnb0p "If you literally catch your AI trying to escape you"
X Link 2024-12-03T22:45Z [----] followers, 16.9K engagements
"I'm extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job. New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"
X Link 2024-12-18T17:47Z [----] followers, [----] engagements
"๐งต New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work we assess their ability to generate strategies"
X Link 2024-12-20T06:06Z [----] followers, [----] engagements
"@NeelNanda5 @rohinmshah @Yoshua_Bengio @jacobandreas To be clear IIRC Ryan and I proposed and pushed for the external review not Anthropic"
X Link 2024-12-20T06:24Z [----] followers, [----] engagements
"I'm proud that the Redwood Research blog made it onto the new @slatestarcodex blogroll. Amusing fact: Every time I post to Redwood's blog it helpfully graphs how poorly the new post is doing compared to @RyanPGreenblatt's viral ARC-AGI post; quite dispiriting"
X Link 2025-01-07T00:20Z [----] followers, [----] engagements
"New post: I think AI safety researchers should consider scenarios without almost any political will to mitigate AI misalignment risk. I describe a particular fun scenario to consider: "Ten people on the inside". https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside"
X Link 2025-01-28T16:52Z [----] followers, [----] engagements
"We collaborated with UK AISI on a paper that goes into more detail on control evaluations for mitigating risk from scheming. I'm excited that UK AISI is starting a control team and look forward to collaborating with them ๐งต What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an https://t.co/5IDKCayrq5 ๐งต What safety measures prevent a misaligned LLM agent from causing a"
X Link 2025-01-30T17:23Z [----] followers, [----] engagements
"I'm enjoying engaging with the European AI safety security and policy community and hope to do so more in the future. Some people have expressed surprise that I own a suit. I'll have you know that I own several suits; see proof attached"
X Link 2025-02-07T16:57Z [----] followers, [---] engagements
"@StephenLCasper As @EthanJPerez says you need to spend equal time on the four parts of the paper: the title the abstract the body and the tweet thread"
X Link 2025-02-21T17:03Z [----] followers, 10.9K engagements
"Funding available from UK AISI for research on AI control You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk"
X Link 2025-03-05T19:12Z [----] followers, [----] engagements
"The best $5 I spent in a while: @firefliesai is an AI notetaker service that shows up to your video calls to transcribe. They let you pay $5/month to rename the bot. I renamed mine to "LiveTweetBot". I find this hilarious; the reactions of people I call have been mixed"
X Link 2025-03-08T19:07Z [----] followers, [----] engagements
"@captgouda24 @RichardHanania Another grandson Li Haoyi is a famous and brilliant programmer"
X Link 2025-04-01T19:14Z [----] followers, [----] engagements
"I had a great time talking to Rob about AI control; I got into a bunch of details that @RyanPGreenblatt and I haven't previously written about. Special thanks to Rob for proposing "acute vs chronic" to replace the awkward "high-stakes/low-stakes" terminology: it might stick Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing https://t.co/UBydDoYo3e Many are skeptical well"
X Link 2025-04-04T14:47Z [----] followers, [----] engagements
"Weve just released the biggest and most intricate study of AI control to date in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters e.g. hacking servers theyre working on"
X Link 2025-04-16T16:29Z [----] followers, 27.7K engagements
"Our core findings: the control techniques that wed previously explored in more toy settings generalize well to this more realistic setting but we can do even better by developing novel techniques that exploit the multi-step nature of the setting"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements
"In particular in this multi-step setting we get the best results with time-travel methods that exploit our ability to reset the AI to a previous step allowing us to incriminate it more reliably than wed be able to do with existing approaches"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements
"Paper link: Website: Blog post with more discussion: https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374 https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements
"In my research and writing on risk from AI misalignment I often talk as if catching your model scheming is a win condition. But that's not clearly true. In a new post I talk about how risk from AI misalignment is affected by AI developers deploying known-scheming models"
X Link 2025-04-18T14:40Z [----] followers, [----] engagements
"We decided to use a different word because we didn't think there was a good word for the core methodology that we were following--trying to come up with safety techniques that work even if the model tries to subvert them. Here are some quotes from the original control post. Do you have a problem with how we framed it here Historically technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control. . We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished"
X Link 2025-05-08T18:14Z [----] followers, [----] engagements
"Over the next few weeks my wife and I will be in Noosa Istanbul and Tbilisi. We're in the market for people to meet and things to do: message me if you want to catch up or meet I'm particularly excited to jam but also happy to talk about AI"
X Link 2025-05-19T03:25Z [----] followers, [----] engagements
"New post It might make sense to pay misaligned AIs to reveal their misalignment and cooperate with us. AIs that are powerful enough to take over probably won't want to accept this kind of deal. But"
X Link 2025-06-21T00:29Z [----] followers, 12.5K engagements
"I love exposing myself to types of computer security risk that no-one has ever been threatened by. (Previously in this series: In that spirit I present a revamped version of my dating website http://reciprocity.pro https://x.com/bshlgrs/status/1840577720465645960 I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun https://t.co/I6qppMZFfk http://reciprocity.pro"
X Link 2025-06-28T16:46Z [----] followers, [----] engagements
"I think the Iraq war has some interesting lessons for AI safety advocates. It's an example of a crazy event (9/11) leading to an extreme action (invading an unrelated country) because the crazy event empowered a pre-existing interest group (neocons)"
X Link 2025-07-10T18:55Z [----] followers, [----] engagements
"I was also targeted by this. Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and never give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. ๐งต Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and never give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. ๐งต"
X Link 2025-07-14T16:55Z [----] followers, [----] engagements
"I think AI takeover risk reduces the EV of the future by less than the risk of poor human choices about what to do with the future. (Though I disagree with many things Will says here e.g. I think P(AI takeover)=35% which is substantially higher than he thinks.) I'm glad he's published this Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing https://t.co/qdQhyzlvJa Today Im releasing an essay series called Better Futures."
X Link 2025-08-04T16:51Z [----] followers, 12.4K engagements
"@ESYudkowsky I'm surprised to hear you acknowledge this. I didn't know that you thought that this was a mistake. I would be very interested to hear why you think you made this incorrect prediction what you now believe and whether this has any downstream implications"
X Link 2025-08-24T14:48Z [----] followers, [----] engagements
Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing
/creator/twitter::bshlgrs