# ![@bshlgrs Avatar](https://lunarcrush.com/gi/w:26/cr:twitter::2993757996.png) @bshlgrs Buck Shlegeris

Buck Shlegeris posts on X about ai, if you, anthropic, this is the most. They currently have [-----] followers and [---] posts still getting attention that total [---] engagements in the last [--] hours.

### Engagements: [---] [#](/creator/twitter::2993757996/interactions)
![Engagements Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::2993757996/c:line/m:interactions.svg)

- [--] Week [------] +377%
- [--] Month [------] +194%
- [--] Months [-------] -93%
- [--] Year [---------] -2.70%

### Mentions: [--] [#](/creator/twitter::2993757996/posts_active)
![Mentions Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::2993757996/c:line/m:posts_active.svg)


### Followers: [-----] [#](/creator/twitter::2993757996/followers)
![Followers Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::2993757996/c:line/m:followers.svg)

- [--] Week [-----] +0.28%
- [--] Month [-----] +1%
- [--] Months [-----] +6.20%
- [--] Year [-----] +23%

### CreatorRank: [---------] [#](/creator/twitter::2993757996/influencer_rank)
![CreatorRank Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::2993757996/c:line/m:influencer_rank.svg)

### Social Influence

**Social category influence**
[technology brands](/list/technology-brands)  [finance](/list/finance)  [social networks](/list/social-networks)  [travel destinations](/list/travel-destinations)  [countries](/list/countries)  [stocks](/list/stocks) 

**Social topic influence**
[ai](/topic/ai), [if you](/topic/if-you), [anthropic](/topic/anthropic) #2570, [this is](/topic/this-is), [in the](/topic/in-the), [llm](/topic/llm), [hope](/topic/hope), [solve](/topic/solve), [so i](/topic/so-i), [design](/topic/design)

**Top accounts mentioned or mentioned by**
[@ryanpgreenblatt](/creator/undefined) [@julianboolean](/creator/undefined) [@ciphergoth](/creator/undefined) [@simeoncps](/creator/undefined) [@trammel530765](/creator/undefined) [@ohabryka](/creator/undefined) [@openai](/creator/undefined) [@redwoodai](/creator/undefined) [@ptrschmdtnlsn](/creator/undefined) [@anthropicai](/creator/undefined) [@elilifland](/creator/undefined) [@taoroalin](/creator/undefined) [@kartographien](/creator/undefined) [@quintinpope5](/creator/undefined) [@besanushi](/creator/undefined) [@esyudkowsky](/creator/undefined) [@ratorthodox](/creator/undefined) [@austinc3301](/creator/undefined) [@slatestarcodex](/creator/undefined) [@eshear](/creator/undefined)

**Top assets mentioned**
[Alphabet Inc Class A (GOOGL)](/topic/$googl)
### Top Social Posts
Top posts by engagements in the last [--] hours

"Here's a simple question about treacherous turn strategy that I wrote (for testing AIs and maybe as an interview question for humans). GPT-4 is not very good at it"  
[X Link](https://x.com/bshlgrs/status/1720960669254910337)  2023-11-05T00:26Z [---] followers, [---] engagements


"If you decide which models to trust using a capability eval like this youll naturally end up segregating your models into smart untrusted models and dumb trusted models. (2/3)"  
[X Link](https://x.com/bshlgrs/status/1721678765846298685)  2023-11-06T23:59Z [----] followers, [---] engagements


"And so if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now a core dynamic is combining dumb trusted models and limited access to humans to make it hard for your untrusted smart models to cause problems. (3/3)"  
[X Link](https://x.com/bshlgrs/status/1721679318370455752)  2023-11-07T00:01Z [---] followers, [---] engagements


"But at most one of A and B can be true at a time. The problem of avoiding takeover prior to catching the first attempt only involves difficulty A but once you catch a real takeover attempt proceeding safely mostly involves difficulty B. (2/2)"  
[X Link](https://x.com/bshlgrs/status/1737960671718084912)  2023-12-21T22:18Z [---] followers, [---] engagements


"@taoroalin Man I would have so confidently told someone that you can't make an antimatter power plant before reading this tweet"  
[X Link](https://x.com/bshlgrs/status/1738285170648858957)  2023-12-22T19:47Z [---] followers, [---] engagements


"@ptrschmdtnlsn @taoroalin Ah sad"  
[X Link](https://x.com/bshlgrs/status/1738287522885845441)  2023-12-22T19:56Z [---] followers, [--] engagements


"@BogdanIonutCir2 @kartographien This is very quantitatively sensitive to how much caution the AI lab is allowed to employ. If they can delay by five years compared to the incautious pace and they're allowed to shut down in the 20% most-dangerous-looking worlds I find 1% takeover risk plausible"  
[X Link](https://x.com/bshlgrs/status/1743264857821220884)  2024-01-05T13:35Z [---] followers, [--] engagements


"@QuintinPope5 I agree with you that it's pretty obvious that these techniques wouldn't remove these backdoors but I don't think everyone believes this so it's nice that they showed it"  
[X Link](https://x.com/anyuser/status/1745956446460948911)  2024-01-12T23:50Z [--] followers, [--] engagements


"@QuintinPope5 I think you should think of this paper as being the initial basic demonstration of a methodology where they demonstrate that obvious baselines don't solve the problem in the hope of researching more novel phenomena/techniques later"  
[X Link](https://x.com/bshlgrs/status/1745956640845865194)  2024-01-12T23:51Z [---] followers, [--] engagements


"@BlancheMinerva @EvanHub @QuintinPope5 I think it's plausible that handling explicitly backdoored models is strictly harder than handling deceptively aligned models so if they could solve the former case they'd be confident they'd solved the latter. (Tbc I think that the former case is probably intractable)"  
[X Link](https://x.com/bshlgrs/status/1745957186487111830)  2024-01-12T23:53Z [---] followers, [---] engagements


"@eshear Yeah I'm like fifty-fifty on whether such AI will be a moral patient. There are also other sources of ethical obligation towards AIs e.g. we should maybe be honest towards them even if they're not moral patients. We think about this a fair bit happy to discuss"  
[X Link](https://x.com/bshlgrs/status/1751000211646992536)  2024-01-26T21:52Z [---] followers, [---] engagements


"@ohabryka @RokoMijic Goodharting can be handled with very different techniques and eval methodologies though so I normally think of it somewhat separately. Happy to send you relevant Google docs if you want Twitter is bad for talking about complicated things"  
[X Link](https://x.com/bshlgrs/status/1751818672560300090)  2024-01-29T04:04Z [---] followers, [--] engagements


"@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it"  
[X Link](https://x.com/anyuser/status/1764701597727416448)  2024-03-04T17:17Z [----] followers, [----] engagements


"@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately but Anthropic's public comms never actually said it and maybe the leadership never said it even privately"  
[X Link](https://x.com/anyuser/status/1764703428281065536)  2024-03-04T17:24Z [----] followers, 44.2K engagements


"@ohabryka I'm not trying to comment on whether it seems good or not just saying that I wouldn't call it deceptive"  
[X Link](https://x.com/bshlgrs/status/1774136778020290868)  2024-03-30T18:09Z [---] followers, [---] engagements


"@an_interstice @julianboolean_ We need more than [--] electrons because Pauli exclusion doesn't matter much for [--] or [--] electrons. If you want four electrons and a neutral molecule it's H4 H2He LiH or He2. Only LiH has a covalent bond. (The physics behind van der Waals is a little different)"  
[X Link](https://x.com/bshlgrs/status/1786450860454883342)  2024-05-03T17:40Z [----] followers, [--] engagements


"@norvid_studies "Physically what's the difference between a blue object and a green object""  
[X Link](https://x.com/bshlgrs/status/1786476491842433211)  2024-05-03T19:22Z [----] followers, [---] engagements


"@besanushi @OpenAI Yeah this is ridiculous it makes using the model as a classifier way harder"  
[X Link](https://x.com/bshlgrs/status/1794095507184275785)  2024-05-24T19:57Z [----] followers, [----] engagements


"@kartographien @besanushi @OpenAI Its not a huge problem for control specifically its just practically quite annoying. Eg when getting trusted monitor classification scores we need to sample [---] repeatedly to get good performance which is kind of annoying"  
[X Link](https://x.com/bshlgrs/status/1794200010009329705)  2024-05-25T02:53Z [----] followers, [---] engagements


"@kartographien @besanushi @OpenAI The logit differences are sometimes huge sometimes changing token probabilities by 10-20% iirc"  
[X Link](https://x.com/bshlgrs/status/1794200262971924720)  2024-05-25T02:54Z [----] followers, [---] engagements


"@iamtrask @OpenAI Please note that the GPT-2 they're talking about is GPT-2 small not GPT-2 XL (which is 12x bigger)"  
[X Link](https://x.com/bshlgrs/status/1800593537937052008)  2024-06-11T18:18Z [----] followers, [---] engagements


"@iamtrask @OpenAI According to Chinchilla scaling laws you should scale up the amount of data similarly to how you scale up model size. So if you can afford 144x the compute you'd train a 12x larger model on a 12x larger dataset"  
[X Link](https://x.com/bshlgrs/status/1800607038583804207)  2024-06-11T19:12Z [----] followers, [---] engagements


"ARC-AGIs been hyped over the last week as a benchmark that LLMs cant solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA"  
[X Link](https://x.com/anyuser/status/1802766374961553887)  2024-06-17T18:12Z [----] followers, 763.6K engagements


"@RyanPGreenblatt Our guess is that Ryans technique beats other solutions despite performing worse at the public eval because other solutions are more overfit to public eval. (But we dont know the performance of MindsAIs solution (@Jcole75Cole) which is sota on Kaggle on this eval set.)"  
[X Link](https://x.com/bshlgrs/status/1806397876962205930)  2024-06-27T18:43Z [----] followers, [---] engagements


"@RyanPGreenblatt @Jcole75Cole This result doesnt clarify everything but at least addresses concerns that Ryans solution is overfit because of data contamination in the data OpenAI used to pretrain GPT-4o"  
[X Link](https://x.com/bshlgrs/status/1806397907673010422)  2024-06-27T18:43Z [----] followers, [---] engagements


"@ohabryka @benlandautaylor The ai x-risk community mostly isnt trying to do what Ben described though"  
[X Link](https://x.com/bshlgrs/status/1822806559946428780)  2024-08-12T01:25Z [----] followers, [---] engagements


"I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question. Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular Why Obviously not because silicon can't implement"  
[X Link](https://x.com/anyuser/status/1825227647049425185)  2024-08-18T17:45Z [----] followers, [----] engagements


"@taoroalin @ptrschmdtnlsn Doesnt Google already substantially penalize slow websites"  
[X Link](https://x.com/bshlgrs/status/1830764943031050494)  2024-09-03T00:29Z [----] followers, [--] engagements


"When people propose examples of how an AI might escape the control of its developers they often describe (explicitly or implicitly) a design for the AI agent scaffold that seems to me to be quite unrealistic. I've written down a design that I think is closer to what we'll get"  
[X Link](https://x.com/anyuser/status/1838312250894897475)  2024-09-23T20:19Z [----] followers, [----] engagements


"One of the main ways this leads to confusion: people refer to some computer "where the AI agent is running" without clarifying which computer they're talking about:"  
[X Link](https://x.com/bshlgrs/status/1838312281865883873)  2024-09-23T20:19Z [----] followers, [---] engagements


"SB [----] still sits on Governor @GavinNewsom's desk awaiting a signature. I hope the Governor can hear through the misinformation spread by the bill's opponents and sign it. I don't trust AI developers to handle the risks they'll impose. SB [----] won't resolve this but it is a start and it will make the situation somewhat better"  
[X Link](https://x.com/anyuser/status/1839426754605437016)  2024-09-26T22:08Z [----] followers, [----] engagements


"I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun up the agent. I came back to my laptop ten minutes later to see that the agent had found the box sshd in then decided to continue: it looked around at the system info decided to upgrade a bunch of stuff including the linux kernel got impatient with apt and so investigated why it was taking so long then eventually the"  
[X Link](https://x.com/anyuser/status/1840577720465645960)  2024-09-30T02:21Z [----] followers, 730K engagements


"@trammel530765 @ciphergoth I have no idea why Claude decided to do this stuff. I agree it's a little more risk-taking than it usually is. Probably Claude just isn't used to having direct shell access"  
[X Link](https://x.com/bshlgrs/status/1840620346694930735)  2024-09-30T05:11Z [----] followers, [----] engagements


"@reedbndr I don't run it inside a VM because the agent needs to be able to help me with random stuff on the actual computers I work with (e.g. today I asked it to make a new user with a random password on a shared instance we use for ML research) :)"  
[X Link](https://x.com/anyuser/status/1840621130945843660)  2024-09-30T05:14Z [----] followers, 24.1K engagements


"@trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8 https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8"  
[X Link](https://x.com/anyuser/status/1840624013355426144)  2024-09-30T05:25Z [----] followers, 91.4K engagements


"A video of an old version of the scaffold (that required human consent before running the code) Waiting for human review is slow so I removed it. I expect humanity to let lots of AIs work autonomously for the same reason. https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0 https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0"  
[X Link](https://x.com/anyuser/status/1840634565553262829)  2024-09-30T06:07Z [----] followers, 35.2K engagements


"@ESYudkowsky I feel like if the algorithmic breakthrough happens more than [--] years before AGI it doesn't really match the picture that I understood you to be presenting"  
[X Link](https://x.com/bshlgrs/status/1842417188675981356)  2024-10-05T04:11Z [----] followers, [---] engagements


"@ESYudkowsky Also the graph you're linking isn't describing a qualitative difference between transformers and other architectures it's describing a quantitative difference"  
[X Link](https://x.com/bshlgrs/status/1842422310390935910)  2024-10-05T04:31Z [----] followers, [---] engagements


"@RatOrthodox @ElijahRavitz can you link a summary of what most concerns you here"  
[X Link](https://x.com/bshlgrs/status/1851794706755162558)  2024-10-31T01:13Z [----] followers, [---] engagements


"@Lang__Leon @OrionJohnston I think this book is going to be aimed at a popular audience and will not contain very concrete or specific arguments of the type that would satisfy you"  
[X Link](https://x.com/bshlgrs/status/1864117216486310285)  2024-12-04T01:19Z [----] followers, [--] engagements


"I'm extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job. New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"  
[X Link](https://x.com/anyuser/status/1869439401623007286)  2024-12-18T17:47Z [----] followers, [----] engagements


"@robertwiblin One amusing consequence of this is that OpenAI was probably using a nontrivial fraction of their compute just on measuring o3's performance on ARC-AGI: they maybe have enough H100s to be worth $500k/hour right now so running a thousand problems takes two full-cluster hours"  
[X Link](https://x.com/bshlgrs/status/1877739219449463022)  2025-01-10T15:28Z [----] followers, [---] engagements


"@ptrschmdtnlsn But like can you do a scaling analysis somehow where you estimate P(white wins) as a function of centipawns with smaller centipawn advantage and then see how that scales to (+0.81 with depth 14)"  
[X Link](https://x.com/bshlgrs/status/1893732579762696701)  2025-02-23T18:40Z [----] followers, [---] engagements


"Weve just released the biggest and most intricate study of AI control to date in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters e.g. hacking servers theyre working on"  
[X Link](https://x.com/anyuser/status/1912543884900724862)  2025-04-16T16:29Z [----] followers, 27.7K engagements


"Over the next few weeks my wife and I will be in Noosa Istanbul and Tbilisi. We're in the market for people to meet and things to do: message me if you want to catch up or meet I'm particularly excited to jam but also happy to talk about AI"  
[X Link](https://x.com/anyuser/status/1924305396384301357)  2025-05-19T03:25Z [----] followers, [----] engagements


"New post It might make sense to pay misaligned AIs to reveal their misalignment and cooperate with us. AIs that are powerful enough to take over probably won't want to accept this kind of deal. But"  
[X Link](https://x.com/anyuser/status/1936219971098755351)  2025-06-21T00:29Z [----] followers, 12.5K engagements


"Of course there's risk of lock-in: theme changes can make it difficult for future users to themselves change the themes. For example "Make everything tiny black-on-black and unclickable""  
[X Link](https://x.com/bshlgrs/status/1941635503104852182)  2025-07-05T23:09Z [----] followers, [---] engagements


"@dylanmatt Thanks I got this mostly from reading "Days of Fire". You might also enjoy this passage about the creation of PEPFAR: https://forum.effectivealtruism.org/posts/Soutcw6ccs8xxyD7v/buck-s-shortformcommentId=ubyzZidqeiCG6NmTL https://forum.effectivealtruism.org/posts/Soutcw6ccs8xxyD7v/buck-s-shortformcommentId=ubyzZidqeiCG6NmTL"  
[X Link](https://x.com/bshlgrs/status/1945860541945421960)  2025-07-17T14:58Z [----] followers, [---] engagements


"@Mihonarium I feel like this is concern trolling; I don't think that letting IMO participants celebrate their achievement is a particularly important issue and I don't think OpenAI really owes IMO anything here"  
[X Link](https://x.com/bshlgrs/status/1947091969198719177)  2025-07-21T00:31Z [----] followers, [----] engagements


"@Mihonarium Why"  
[X Link](https://x.com/bshlgrs/status/1947307108439089409)  2025-07-21T14:46Z [----] followers, [---] engagements


"I think AI takeover risk reduces the EV of the future by less than the risk of poor human choices about what to do with the future. (Though I disagree with many things Will says here e.g. I think P(AI takeover)=35% which is substantially higher than he thinks.) I'm glad he's published this Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing https://t.co/qdQhyzlvJa Today Im releasing an essay series called Better Futures."  
[X Link](https://x.com/anyuser/status/1952412170794467492)  2025-08-04T16:51Z [----] followers, 12.4K engagements


"Yeah I think there's a lot of worlds where there's superintelligence but no AI takeover (which is a weak sense of "alignment") but people decide to do things with the future that look really dumb by my lights. Basic mechanisms are: Value disagreements Disagreements about what process they should follow/them committing early to something dumb/them "going crazy" Various acausal considerations"  
[X Link](https://x.com/bshlgrs/status/1952433365959029000)  2025-08-04T18:16Z [----] followers, [----] engagements


"My guess is that Alzheimers cures [--] years after full-blown superintelligence are totally plausible (and full-blown superintelligence in [--] years is 25% likely). I think my main objection to your piece is that you don't seem to reckon with the huge scale of intellectual and physical labor that could be brought to bear on this problem given access to ASI. I don't think your post engaged with my cruxes at all though perhaps you have a very different target audience"  
[X Link](https://x.com/bshlgrs/status/1952517562094522732)  2025-08-04T23:50Z [----] followers, [---] engagements


"The practical question which these questions feed into is: How should I feel about interventions that prevent AI takeover vs interventions that shift the distribution of power among post-singularity humans My current position is that my values are not totally orthogonal to all human values. So the human-controlled future is way more than 1e-10 as good as if I had complete control. Maybe my guess is that it's 3e-2 as good. 3e-2 is an important number because it's the conversion factor between changes in P(AI takeover) and changes in my expected share of control over the future (which are most"  
[X Link](https://x.com/bshlgrs/status/1952564259969458592)  2025-08-05T02:56Z [----] followers, [---] engagements


"My guess is that our disagreement here is because my default picture of a non-AI takeover future is one where people have control distributed in unfair ways that are substantially determined by stuff that happened before/during the singularity and you think that the default outcome is a more equitable distribution"  
[X Link](https://x.com/bshlgrs/status/1952564612479725698)  2025-08-05T02:57Z [----] followers, [---] engagements


"This role provides substantial leverage for advancing AI safety research and developing the next generation of safety researchers. The Anthropic Fellows Program is already operating at serious scale (50 fellows) and could be much bigger and better if led by the right person. Were hiring someone to run the Anthropic Fellows Program Our research collaborations have led to some of our best safety research and hires. Were looking for an exceptional ops generalist TPM or research/eng manager to help us significantly scale and improve our collabs 🧵 Were hiring someone to run the Anthropic Fellows"  
[X Link](https://x.com/bshlgrs/status/1965571687652655236)  2025-09-10T00:23Z [----] followers, [----] engagements


"@deanwball The model you can produce for $20 is the smallest gpt2 which is the size of gpt1. The big gpt2 gpt2-xl is about 10x bigger and 100x more expensive to train"  
[X Link](https://x.com/bshlgrs/status/1795635275714253064)  2024-05-29T01:56Z [----] followers, [---] engagements


"@hamandcheese @SenatorRounds @MarkWarner Looks great I find it surprising that friends of mine who love mechanism design aren't aware that whistleblower incentive programs are a large part of how financial regulation is enforced in the US"  
[X Link](https://x.com/bshlgrs/status/1910776689938120882)  2025-04-11T19:27Z [----] followers, [---] engagements


"@austinc3301 @ciphergoth @JacquesThibs @Aella_Girl If you enjoy that video you might enjoy McCain's concession speech https://www.youtube.com/watchv=4zH9_Q-eImQ https://www.youtube.com/watchv=4zH9_Q-eImQ"  
[X Link](https://x.com/bshlgrs/status/1970186772513726528)  2025-09-22T18:01Z [----] followers, [---] engagements


"@allTheYud I think it's very implausible that this sentiment is mostly downstream of anything Anthropic says"  
[X Link](https://x.com/bshlgrs/status/1979207779094311291)  2025-10-17T15:28Z [----] followers, [----] engagements


"@allTheYud @eshear Can you spell out the important takeaway from this"  
[X Link](https://x.com/bshlgrs/status/1979245381163847849)  2025-10-17T17:57Z [----] followers, [---] engagements


"I assume that you roughly know how LLM inference works right Are you proposing that it should be possible to make much smaller language models that match today's quality or are you saying that it should be possible to massively increase inference speed with current models and current hardware People are working hard on the first. Every year the size of model you need to do a particular task goes down. (By "size" I really mean active parameters.) They would obviously prioritize this somewhat harder if there was way less capital. But for what it's worth there's already a lot of competition over"  
[X Link](https://x.com/bshlgrs/status/1985012461112836214)  2025-11-02T15:53Z [----] followers, 12.9K engagements


"In what sense Companies that develop foundation models are incredibly strongly incentivized to compete on cost and inference speed as well as quality. Based on my personal experience talking to relevant people they are not somehow totally dropping the ball here. And ML researchers would love to come up with ways of improving cost efficiency and performance. They would just think that was super cool on a personal level"  
[X Link](https://x.com/bshlgrs/status/1985027164627235254)  2025-11-02T16:52Z [----] followers, [----] engagements


"@ohabryka @austinc3301 @RatOrthodox I think Ronny's post is worse than the original; I think the original is not sneering Alexander is expressing an opinion without justifying it which IMO is fine"  
[X Link](https://x.com/bshlgrs/status/2001383252880454135)  2025-12-17T20:05Z [----] followers, [---] engagements


"@RatOrthodox @austinc3301 It's a moratorium on building AI data centers in America which is super different than a global moratorium"  
[X Link](https://x.com/bshlgrs/status/2001439619053949300)  2025-12-17T23:49Z [----] followers, [---] engagements


"@RyanPGreenblatt and I recorded a podcast. We talked about: - What AI safety research makes us most envious - Whether we should be more upfront about some of our weird beliefs - Redwood Research history - Many listener questions Available on Youtube/Spotify/Substack/etc"  
[X Link](https://x.com/bshlgrs/status/2008670776942358844)  2026-01-06T22:43Z [----] followers, [----] engagements


"@stalkermustang @RyanPGreenblatt He doesn't bench at the moment but he estimates his 1-rep max would be 185lb. He can do one-armed pull-ups though :P"  
[X Link](https://x.com/bshlgrs/status/2008730063290986781)  2026-01-07T02:39Z [----] followers, [--] engagements


"Ryan did very well on this. I was top 8% (top 5% if you restrict to people who submitted before o3) which I think puts me fourth among Redwood Research staffthey are tough competition for AI forecasting My predictions for [----] look decent (I was 2/413 on the survey). I generally overestimated benchmark progress and underestimated revenue growth. Consider filling out the [----] forecasting survey (link in thread) https://t.co/PwJASw679y My predictions for [----] look decent (I was 2/413 on the survey). I generally overestimated benchmark progress and underestimated revenue growth. Consider filling"  
[X Link](https://x.com/bshlgrs/status/2012352983237845145)  2026-01-17T02:35Z [----] followers, [----] engagements


"@gcolbourn @willmacaskill @TomDavidsonX I'm saying that an industrial explosion might happen without having passed a point-of-no-return. Part of this is that it seems like an industrial explosion might involve AIs that aren't wildly superhuman. (Obviously all these terms are hard to operationalize.)"  
[X Link](https://x.com/bshlgrs/status/2016607010092175571)  2026-01-28T20:19Z [----] followers, [--] engagements


"Important new post: The arguments that AIs will pursue reward also suggest that AIs might pursue some other similar goals like "getting deployed". These similar goals have very different implications for risk so it's valuable to consider how their risk profile differs. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should also think "fitness-seekers" are plausible. But their risks aren't the same. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should"  
[X Link](https://x.com/bshlgrs/status/2016998602858811692)  2026-01-29T22:15Z [----] followers, [----] engagements


"@RyanPGreenblatt and I are going to record another podcast episode tomorrow. What should we ask each other"  
[X Link](https://x.com/bshlgrs/status/2017772871578529929)  2026-02-01T01:31Z [----] followers, [----] engagements


"@ohabryka And it's of course a huge driver for polarizing AI safety and adjacent topics. I can't think of an interpretation of this sentence that I agree with. You're saying that this report contributes to polarization of AIS by only citing Arxiv rather than blog posts"  
[X Link](https://x.com/bshlgrs/status/2018759782505807918)  2026-02-03T18:53Z [----] followers, [----] engagements


"@ajeya_cotra FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it's most likely you can get insane speed ups from LLMs while writing huge codebases"  
[X Link](https://x.com/bshlgrs/status/2019859245123121199)  2026-02-06T19:42Z [----] followers, [----] engagements


"I think I did actually forget to tweet about this. I did a podcast with @RyanPGreenblatt (recorded six months ago released [---] months ago) that I'm pretty happy with. unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I definitely laughed a few times. https://t.co/aTWpMlcc9A unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I"  
[X Link](https://x.com/bshlgrs/status/2020276590891135049)  2026-02-07T23:20Z [----] followers, [----] engagements


"@Thomas_Woodside Why not talk to local open-source LLMs"  
[X Link](https://x.com/bshlgrs/status/2022098471051047085)  2026-02-13T00:00Z [----] followers, [----] engagements


"@Thomas_Woodside Yeah I was imagining talking to the AI in a way that has no retention by default"  
[X Link](https://x.com/bshlgrs/status/2022119694124359953)  2026-02-13T01:24Z [----] followers, [---] engagements


"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022784270977425652 https://twitter.com/i/web/status/2022784270977425652"  
[X Link](https://x.com/bshlgrs/status/2022784270977425652)  2026-02-14T21:25Z [----] followers, [---] engagements


"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022785775411368438 https://twitter.com/i/web/status/2022785775411368438"  
[X Link](https://x.com/bshlgrs/status/2022785775411368438)  2026-02-14T21:31Z [----] followers, [---] engagements


"Weve just released the biggest and most intricate study of AI control to date in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters e.g. hacking servers theyre working on"  
[X Link](https://x.com/anyuser/status/1912543884900724862)  2025-04-16T16:29Z [----] followers, 27.7K engagements


"I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun up the agent. I came back to my laptop ten minutes later to see that the agent had found the box sshd in then decided to continue: it looked around at the system info decided to upgrade a bunch of stuff including the linux kernel got impatient with apt and so investigated why it was taking so long then eventually the"  
[X Link](https://x.com/anyuser/status/1840577720465645960)  2024-09-30T02:21Z [----] followers, 730K engagements


"ARC-AGIs been hyped over the last week as a benchmark that LLMs cant solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA"  
[X Link](https://x.com/anyuser/status/1802766374961553887)  2024-06-17T18:12Z [----] followers, 763.6K engagements


"If only Newsom hadn't vetoed SB [----] maybe I would have been protected from this outcome"  
[X Link](https://x.com/anyuser/status/1840577916171948444)  2024-09-30T02:22Z [----] followers, 61.7K engagements


"📷 Announcing ControlConf: The worlds first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if theyre trying to subvert those controls. March 27-28 [----] in London. 🧵"  
[X Link](https://x.com/anyuser/status/1896648442551869445)  2025-03-03T19:46Z [----] followers, 25.6K engagements


"If you like writing cursed AI agent code and want to develop techniques that prevent future AI agents from sabotaging the systems theyre running on you might enjoy interning with me over the winter: https://www.matsprogram.org/ https://www.matsprogram.org/"  
[X Link](https://x.com/anyuser/status/1840631602055090260)  2024-09-30T05:55Z [----] followers, 51.2K engagements


"@RyanPGreenblatt who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one) did lots of fancy tricks to get the performance this high; you can see the details on our blog"  
[X Link](https://x.com/anyuser/status/1802766526640230857)  2024-06-17T18:13Z [----] followers, 19.7K engagements


"Logs here if you need them. https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am"  
[X Link](https://x.com/anyuser/status/1840628348533534930)  2024-09-30T05:42Z [----] followers, 81.3K engagements


"@trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8 https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8"  
[X Link](https://x.com/anyuser/status/1840624013355426144)  2024-09-30T05:25Z [----] followers, 91.4K engagements


"Ryan's approach involves a long carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates 5k guesses selects the best ones using the examples then has a debugging step"  
[X Link](https://x.com/anyuser/status/1802766426140540970)  2024-06-17T18:13Z [----] followers, 25K engagements


"I always find it very amusing that "stop deploying AIs if you catch them trying to escape" sounds like something AI companies would obviously agree to but in practice it's fairly unlikely AI companies will make that commitment as I argued a few months ago. "If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop https://t.co/swEaixnb0p "If you literally catch your AI trying to escape you"  
[X Link](https://x.com/anyuser/status/1864078501877264627)  2024-12-03T22:45Z [----] followers, 16.9K engagements


""If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop"  
[X Link](https://x.com/anyuser/status/1863982804083577097)  2024-12-03T16:25Z 19K followers, 195.3K engagements


"This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve"  
[X Link](https://x.com/anyuser/status/1802766471686443145)  2024-06-17T18:13Z [----] followers, 19.2K engagements


"For context ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator @fchollet claims that LLMs are unable to learn which is why they can't perform well on this benchmark"  
[X Link](https://x.com/anyuser/status/1802766403034054863)  2024-06-17T18:12Z [----] followers, 24.6K engagements


"I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. 🧵"  
[X Link](https://x.com/anyuser/status/1828112072653316115)  2024-08-26T16:47Z [----] followers, 12K engagements


"Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses"  
[X Link](https://x.com/anyuser/status/1802766498655834279)  2024-06-17T18:13Z [----] followers, 29.2K engagements


"New paper We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. 🧵"  
[X Link](https://x.com/anyuser/status/1734967328599945621)  2023-12-13T16:03Z [----] followers, 17.6K engagements


"SB [----] still sits on Governor @GavinNewsom's desk awaiting a signature. I hope the Governor can hear through the misinformation spread by the bill's opponents and sign it. I don't trust AI developers to handle the risks they'll impose. SB [----] won't resolve this but it is a start and it will make the situation somewhat better"  
[X Link](https://x.com/anyuser/status/1839426754605437016)  2024-09-26T22:08Z [----] followers, [----] engagements


"@StephenLCasper As @EthanJPerez says you need to spend equal time on the four parts of the paper: the title the abstract the body and the tweet thread"  
[X Link](https://x.com/anyuser/status/1892983458714222778)  2025-02-21T17:03Z [----] followers, 10.9K engagements


"https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt"  
[X Link](https://x.com/anyuser/status/1802767458836918565)  2024-06-17T18:17Z [----] followers, 24.6K engagements


"Lots of people have correctly observed that around the development of AGI computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:"  
[X Link](https://x.com/anyuser/status/1799582482448503186)  2024-06-08T23:21Z [----] followers, 25.5K engagements


"The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of [---] problems that we didn't iterate on.)"  
[X Link](https://x.com/anyuser/status/1802766449075003770)  2024-06-17T18:13Z [----] followers, 25.3K engagements


"I agree with these experts that it would be good for SB [----] to pass. Ive been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer. Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://t.co/Qd8Taw7i5G Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart"  
[X Link](https://x.com/anyuser/status/1821607466792305003)  2024-08-08T18:00Z [----] followers, 38.8K engagements


"Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://time.com/7008947/california-ai-bill-letter/ https://time.com/7008947/california-ai-bill-letter/"  
[X Link](https://x.com/anyuser/status/1821580807057871295)  2024-08-08T16:14Z 111.2K followers, 232.5K engagements


"I'm extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job. New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"  
[X Link](https://x.com/anyuser/status/1869439401623007286)  2024-12-18T17:47Z [----] followers, [----] engagements


"New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"  
[X Link](https://x.com/anyuser/status/1869438979503952179)  2024-12-18T17:45Z [----] followers, 120.6K engagements


"@deanwball @stuartbuck1 @jbarro @TheStalwart @hamandcheese I feel like youre conflating the claims peoples actions can be predicted with models and peoples actions can be predicted by modeling them as rational actors who have a plan"  
[X Link](https://x.com/anyuser/status/1907659679448371240)  2025-04-03T05:01Z [----] followers, [----] engagements


"@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said it's crucial context that the model was told to pursue a goal at any cost"  
[X Link](https://x.com/anyuser/status/1864828598999478610)  2024-12-06T00:25Z [----] followers, [----] engagements


"New post: I think AI safety researchers should consider scenarios without almost any political will to mitigate AI misalignment risk. I describe a particular fun scenario to consider: "Ten people on the inside". https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside"  
[X Link](https://x.com/anyuser/status/1884283464591237369)  2025-01-28T16:52Z [----] followers, [----] engagements


"Read more about my main research direction here https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled"  
[X Link](https://x.com/anyuser/status/1840631918930633145)  2024-09-30T05:57Z [----] followers, 40.9K engagements


"@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics which is somewhat complicated. A rough attempt at the answer is that:"  
[X Link](https://x.com/anyuser/status/1786442347087310884)  2024-05-03T17:07Z [----] followers, 41.4K engagements


"New post It might make sense to pay misaligned AIs to reveal their misalignment and cooperate with us. AIs that are powerful enough to take over probably won't want to accept this kind of deal. But"  
[X Link](https://x.com/anyuser/status/1936219971098755351)  2025-06-21T00:29Z [----] followers, 12.5K engagements


"@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately but Anthropic's public comms never actually said it and maybe the leadership never said it even privately"  
[X Link](https://x.com/anyuser/status/1764703428281065536)  2024-03-04T17:24Z [----] followers, 44.2K engagements


"@NeelNanda5 For uh personal reasons I am very interested in what the next page says"  
[X Link](https://x.com/anyuser/status/1949540894929293535)  2025-07-27T18:42Z [----] followers, [----] engagements


"I've been arguing for years that internal deployments are plausibly most of the risk and I've been unhappy about focus on the external deployments--they're most of the (relatively unimportant) downside of AI now but it would be a mistake to fixate on them. Glad to see this 🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment. https://t.co/ZPY4sZhTl6 🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer"  
[X Link](https://x.com/anyuser/status/1913006740758782267)  2025-04-17T23:08Z [----] followers, [----] engagements


"🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment"  
[X Link](https://x.com/anyuser/status/1912885708269699358)  2025-04-17T15:07Z [----] followers, 76.9K engagements


"In a new blog post we argue that AI labs should ensure that powerful AIs are *controlled*. That is labs should make sure that their safety measures prevent unacceptable outcomes even if their powerful models intentionally try to subvert the safety measures. 🧵"  
[X Link](https://x.com/anyuser/status/1750381041389642085)  2024-01-25T04:52Z [----] followers, 29.9K engagements


"I'm excited for the new AI control research team at UK AISI I've been talking to them a lot; I think they're well positioned to do a bunch of useful research. I recommend applying if you're interested in AI control especially if you want to be in London Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us"  
[X Link](https://x.com/anyuser/status/1891915088795148694)  2025-02-18T18:18Z [----] followers, [----] engagements


"Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us"  
[X Link](https://x.com/anyuser/status/1891883547058696250)  2025-02-18T16:12Z [----] followers, 79.9K engagements


"@reedbndr I don't run it inside a VM because the agent needs to be able to help me with random stuff on the actual computers I work with (e.g. today I asked it to make a new user with a random password on a shared instance we use for ML research) :)"  
[X Link](https://x.com/anyuser/status/1840621130945843660)  2024-09-30T05:14Z [----] followers, 24.1K engagements


"The best $5 I spent in a while: @firefliesai is an AI notetaker service that shows up to your video calls to transcribe. They let you pay $5/month to rename the bot. I renamed mine to "LiveTweetBot". I find this hilarious; the reactions of people I call have been mixed"  
[X Link](https://x.com/anyuser/status/1898450517900333353)  2025-03-08T19:07Z [----] followers, [----] engagements


"I've had a great time at IASEAI '25 in Paris. The room I was speaking in (at the OECD headquarters) felt like a senate committee room a huge square of desks with microphones. Luckily they allowed me to speak standing up"  
[X Link](https://x.com/anyuser/status/1887907404467118241)  2025-02-07T16:52Z [----] followers, [----] engagements


"Im with Will on this one. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view"  
[X Link](https://x.com/anyuser/status/1906460501481295957)  2025-03-30T21:36Z [----] followers, [----] engagements


"@MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view"  
[X Link](https://x.com/anyuser/status/1906420229883797883)  2025-03-30T18:56Z 62.9K followers, 21.6K engagements


"A video of an old version of the scaffold (that required human consent before running the code) Waiting for human review is slow so I removed it. I expect humanity to let lots of AIs work autonomously for the same reason. https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0 https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0"  
[X Link](https://x.com/anyuser/status/1840634565553262829)  2024-09-30T06:07Z [----] followers, 35.2K engagements


"I think AI takeover risk reduces the EV of the future by less than the risk of poor human choices about what to do with the future. (Though I disagree with many things Will says here e.g. I think P(AI takeover)=35% which is substantially higher than he thinks.) I'm glad he's published this Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing https://t.co/qdQhyzlvJa Today Im releasing an essay series called Better Futures."  
[X Link](https://x.com/anyuser/status/1952412170794467492)  2025-08-04T16:51Z [----] followers, 12.4K engagements


"Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing"  
[X Link](https://x.com/anyuser/status/1952372232468193364)  2025-08-04T14:13Z 62.9K followers, 49.7K engagements


"If you want to contribute to AI control research consider applying to this program: some great Anthropic people are going to be supervising control projects (probably partially in collaboration with me and others at Redwood). Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers transition into"  
[X Link](https://x.com/anyuser/status/1863697372385251717)  2024-12-02T21:30Z [----] followers, [----] engagements


"Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds"  
[X Link](https://x.com/anyuser/status/1863648517551513605)  2024-12-02T18:16Z 837K followers, 506.6K engagements


"My colleagues and I are currently finishing up a paper and many blog posts on techniques for preventing models from causing catastrophic outcomes that are robust to the models intentionally trying to subvert the safety techniques. I'd love beta readers DM me if interested"  
[X Link](https://x.com/anyuser/status/1730757257586225313)  2023-12-02T01:14Z [----] followers, [----] engagements


"I'm proud that the Redwood Research blog made it onto the new @slatestarcodex blogroll. Amusing fact: Every time I post to Redwood's blog it helpfully graphs how poorly the new post is doing compared to @RyanPGreenblatt's viral ARC-AGI post; quite dispiriting"  
[X Link](https://x.com/anyuser/status/1876423577802838038)  2025-01-07T00:20Z [----] followers, [----] engagements


"Last week there was some uncertainty about whether @RyanPGreenblatt's ARC-AGI solution was really sota because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard. Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini https://t.co/lyfIKNOioL Last week @RyanPGreenblatt shared"  
[X Link](https://x.com/anyuser/status/1806397587085468116)  2024-06-27T18:42Z [----] followers, [----] engagements


"Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini"  
[X Link](https://x.com/anyuser/status/1806372523170533457)  2024-06-27T17:02Z 46.8K followers, 51.1K engagements


"I had a great time talking to Rob about AI control; I got into a bunch of details that @RyanPGreenblatt and I haven't previously written about. Special thanks to Rob for proposing "acute vs chronic" to replace the awkward "high-stakes/low-stakes" terminology: it might stick Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing https://t.co/UBydDoYo3e Many are skeptical well"  
[X Link](https://x.com/anyuser/status/1908169634479775744)  2025-04-04T14:47Z [----] followers, [----] engagements


"Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing whether we can trust it. Enjoy links and transcript below: 01:51 What's AI control and why is it hot 10:44 Detecting human vs AI spies 18:10 How to catch AIs trying to escape 33:18 Cheapest AI control techniques 51:01 If we catch a model escaping. will we do anything 53:39 Getting AI models to think they've already escaped"  
[X Link](https://x.com/anyuser/status/1908136423875817533)  2025-04-04T12:35Z 46.5K followers, 53.6K engagements


"I appeared on @computer_phile explaining my favorite theoretical computer science fun fact which I learned from @ptrschmdtnlsn. https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile"  
[X Link](https://x.com/anyuser/status/1912648780249477628)  2025-04-16T23:26Z [----] followers, [----] engagements


"We collaborated with UK AISI on a paper that goes into more detail on control evaluations for mitigating risk from scheming. I'm excited that UK AISI is starting a control team and look forward to collaborating with them 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an https://t.co/5IDKCayrq5 🧵 What safety measures prevent a misaligned LLM agent from causing a"  
[X Link](https://x.com/anyuser/status/1885015918969029082)  2025-01-30T17:23Z [----] followers, [----] engagements


"🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an evaluation that assures safety for a specific deployment context"  
[X Link](https://x.com/anyuser/status/1885014578641228098)  2025-01-30T17:17Z [----] followers, 28.2K engagements


"Jiaxin Caleb and Vivek did a great job on this new paper. Here's my summary of it: If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain https://t.co/OSzmcWGzZe If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better"  
[X Link](https://x.com/anyuser/status/1861601333171884112)  2024-11-27T02:41Z [----] followers, 10.8K engagements


"If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain"  
[X Link](https://x.com/anyuser/status/1861600414036304187)  2024-11-27T02:38Z [----] followers, 25.3K engagements


"I built a tiny game where you design and fly planes. It's like a simpler shittier Kerbal Space Program with a simpler and more first-principles in its physics simulator. It has the concepts of vertices Hookean springs faces and an engine. 🧵"  
[X Link](https://x.com/anyuser/status/1742990507813580836)  2024-01-04T19:24Z [----] followers, [----] engagements


"It was very helpful that @AnthropicAI gave Ryan access to do this research"  
[X Link](https://x.com/anyuser/status/1869479983200247823)  2024-12-18T20:28Z [----] followers, [----] engagements


"Props to Anthropic for providing Ryan with employee-level model access; I'd love to see more of this type of support for independent AI safety research"  
[X Link](https://x.com/anyuser/status/1869475391871803785)  2024-12-18T20:10Z [----] followers, [----] engagements


"- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models they'll outcompete us within a year and you won't be able to get China to pause without substantial risk of war. - AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. - Even if the AI is indeed doing something systematically funny we have no evidence"  
[X Link](https://x.com/anyuser/status/1828112416103952795)  2024-08-26T16:48Z [----] followers, [----] engagements


"This scenario does a great job spelling out what the development of really powerful AI might look like under somewhat aggressive but IMO plausible assumptions about AI progress. I think that it's like 25% likely that things progress at least this quickly. "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen https://t.co/v0V0RbFoVA "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and"  
[X Link](https://x.com/anyuser/status/1907832875313148190)  2025-04-03T16:29Z [----] followers, [----] engagements


""How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen"  
[X Link](https://x.com/anyuser/status/1907826614186209524)  2025-04-03T16:04Z 29.6K followers, 2.9M engagements


"I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question. Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular Why Obviously not because silicon can't implement"  
[X Link](https://x.com/anyuser/status/1825227647049425185)  2024-08-18T17:45Z [----] followers, [----] engagements


"Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular part of the design space reached by a particular kind of hill-climbing. There's a causal story behind how some humans and human cultures ended up nice in a place that I think would get even nicer with further intelligence enhancement; that causal story does not look convergent to me. Our particular story isn't the only"  
[X Link](https://x.com/anyuser/status/1660623336567889920)  2023-05-22T12:27Z 216.3K followers, 70K engagements


"@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it"  
[X Link](https://x.com/anyuser/status/1764701597727416448)  2024-03-04T17:17Z [----] followers, [----] engagements


"I'm enjoying engaging with the European AI safety security and policy community and hope to do so more in the future. Some people have expressed surprise that I own a suit. I'll have you know that I own several suits; see proof attached"  
[X Link](https://x.com/anyuser/status/1887908562443129294)  2025-02-07T16:57Z [----] followers, [---] engagements


"I was also targeted by this. Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and *never* give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵 Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and *never* give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵"  
[X Link](https://x.com/anyuser/status/1944802846521876526)  2025-07-14T16:55Z [----] followers, [----] engagements


"Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and *never* give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵 And @ajeya_cotra's account has been hacked by the same folks - if you get messages from her asking to schedule a meeting be very wary She says she will never reach out to a potential grantee by Twitter always email. And @ajeya_cotra's account has been hacked by the same folks - if you get messages from her asking to schedule a"  
[X Link](https://x.com/anyuser/status/1944798467886624923)  2025-07-14T16:37Z 32.4K followers, 87.4K engagements


"I think the Iraq war has some interesting lessons for AI safety advocates. It's an example of a crazy event (9/11) leading to an extreme action (invading an unrelated country) because the crazy event empowered a pre-existing interest group (neocons)"  
[X Link](https://x.com/anyuser/status/1943383568287437279)  2025-07-10T18:55Z [----] followers, [----] engagements


"@robertwiblin Ajeya Cotra Daniel Kokotajlo redwood people"  
[X Link](https://x.com/anyuser/status/1882110937504928179)  2025-01-22T16:59Z [----] followers, [----] engagements


"@julianboolean_ Ok actually upon more thinking I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer"  
[X Link](https://x.com/anyuser/status/1786545330995311007)  2024-05-03T23:56Z [----] followers, [----] engagements


"In my research and writing on risk from AI misalignment I often talk as if catching your model scheming is a win condition. But that's not clearly true. In a new post I talk about how risk from AI misalignment is affected by AI developers deploying known-scheming models"  
[X Link](https://x.com/anyuser/status/1913241308208238884)  2025-04-18T14:40Z [----] followers, [----] engagements


"@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasnt socially connected to them (eg see rat coverage of prospera prediction markets soylent etc)"  
[X Link](https://x.com/anyuser/status/1778853373137367169)  2024-04-12T18:31Z [----] followers, [----] engagements


"Funding available from UK AISI for research on AI control You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk"  
[X Link](https://x.com/anyuser/status/1897364736297329113)  2025-03-05T19:12Z [----] followers, [----] engagements


"You'd make the control team at AISI very excited with a project proposal on one of these topics"  
[X Link](https://x.com/anyuser/status/1897247507220177194)  2025-03-05T11:27Z [----] followers, [----] engagements


"@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion pauli exclusion and electron kinetic energy constraints" if asked to give a short technically correct answer"  
[X Link](https://x.com/anyuser/status/1786444440888033531)  2024-05-03T17:15Z [----] followers, [----] engagements


"When people propose examples of how an AI might escape the control of its developers they often describe (explicitly or implicitly) a design for the AI agent scaffold that seems to me to be quite unrealistic. I've written down a design that I think is closer to what we'll get"  
[X Link](https://x.com/anyuser/status/1838312250894897475)  2024-09-23T20:19Z [----] followers, [----] engagements


"@GavinNewsom I think AI is really cool and don't care much about any of the current harms and I think the US is usually too wary of novel technology. This is a rare exception because of the possibility for extreme catastrophic outcomes"  
[X Link](https://x.com/anyuser/status/1839427687452201086)  2024-09-26T22:11Z [----] followers, [----] engagements


"When AI companies publish about their work theyre engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in but the ambiguity leads to mismatched expectations and frustration. Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g"  
[X Link](https://x.com/anyuser/status/1823378267573002505)  2024-08-13T15:17Z [----] followers, [----] engagements


"Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://asteriskmag.com/issues/07/can-you-trust-an-ai-press-release https://asteriskmag.com/issues/07/can-you-trust-an-ai-press-release"  
[X Link](https://x.com/anyuser/status/1823011022548181445)  2024-08-12T14:57Z [----] followers, 35.9K engagements


"@ESYudkowsky I'm surprised to hear you acknowledge this. I didn't know that you thought that this was a mistake. I would be very interested to hear why you think you made this incorrect prediction what you now believe and whether this has any downstream implications"  
[X Link](https://x.com/anyuser/status/1959628988252602681)  2025-08-24T14:48Z [----] followers, [----] engagements


"@TetraspaceWest We should have a paper on control techniques for bash agents out soon :)"  
[X Link](https://x.com/anyuser/status/1840626373993595098)  2024-09-30T05:34Z [----] followers, 17.3K engagements


"Our core findings: the control techniques that wed previously explored in more toy settings generalize well to this more realistic setting but we can do even better by developing novel techniques that exploit the multi-step nature of the setting"  
[X Link](https://x.com/anyuser/status/1912543886347780122)  2025-04-16T16:29Z [----] followers, [----] engagements


"Over the next few weeks my wife and I will be in Noosa Istanbul and Tbilisi. We're in the market for people to meet and things to do: message me if you want to catch up or meet I'm particularly excited to jam but also happy to talk about AI"  
[X Link](https://x.com/anyuser/status/1924305396384301357)  2025-05-19T03:25Z [----] followers, [----] engagements


"@trammel530765 @ciphergoth re missing technical details: it's a tweet not a paper dude if you want a paper on LLM agents I recommend Kapoor et al "AI Agents That Matter""  
[X Link](https://x.com/anyuser/status/1840625940491305366)  2024-09-30T05:33Z [----] followers, [----] engagements


"@Simeon_Cps Here's one related quote https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/ https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/"  
[X Link](https://x.com/anyuser/status/1764713562315067490)  2024-03-04T18:04Z [----] followers, [----] engagements


"@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle"  
[X Link](https://x.com/anyuser/status/1786442696351199623)  2024-05-03T17:08Z [----] followers, [----] engagements


"We decided to use a different word because we didn't think there was a good word for the core methodology that we were following--trying to come up with safety techniques that work even if the model tries to subvert them. Here are some quotes from the original control post. Do you have a problem with how we framed it here Historically technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control.  . We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished"  
[X Link](https://x.com/anyuser/status/1920542831980261774)  2025-05-08T18:14Z [----] followers, [----] engagements


"I love exposing myself to types of computer security risk that no-one has ever been threatened by. (Previously in this series: In that spirit I present a revamped version of my dating website http://reciprocity.pro https://x.com/bshlgrs/status/1840577720465645960 I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun https://t.co/I6qppMZFfk http://reciprocity.pro"  
[X Link](https://x.com/anyuser/status/1939002424066482549)  2025-06-28T16:46Z [----] followers, [----] engagements


"🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work we assess their ability to generate strategies"  
[X Link](https://x.com/anyuser/status/1869987796833939919)  2024-12-20T06:06Z [----] followers, [----] engagements


"Im sympathetic to all of these arguments; the only reason Id be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely I wouldnt change my mind based on one weird observation"  
[X Link](https://x.com/anyuser/status/1828112504721195318)  2024-08-26T16:49Z [----] followers, [----] engagements


"You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose"  
[X Link](https://x.com/anyuser/status/1828112206518497715)  2024-08-26T16:48Z [----] followers, [----] engagements


"@NeelNanda5 @rohinmshah @Yoshua_Bengio @jacobandreas To be clear IIRC Ryan and I proposed and pushed for the external review not Anthropic"  
[X Link](https://x.com/anyuser/status/1869992322940973443)  2024-12-20T06:24Z [----] followers, [----] engagements


"And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat but that its a risk so large that its worth trading off our response to other risks that might be extremely pressing and salient"  
[X Link](https://x.com/anyuser/status/1828112789501813245)  2024-08-26T16:50Z [----] followers, [----] engagements


"@trammel530765 @ciphergoth Writing this kind of wrapper is very easy its like [---] lines of python. I ask my potential interns to implement one as part of their applications"  
[X Link](https://x.com/anyuser/status/1840630589243752867)  2024-09-30T05:51Z [----] followers, [----] engagements


"@MelindaBChu1 You can scroll down on the page Dan linked to see more names"  
[X Link](https://x.com/anyuser/status/1833172234850173386)  2024-09-09T15:54Z [----] followers, [---] engagements


"Paper link: Website: Blog post with more discussion: https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374 https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374"  
[X Link](https://x.com/anyuser/status/1912543903770898946)  2025-04-16T16:29Z [----] followers, [----] engagements


"(1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control e.g. that they're optimal at planning and they're maximally situationally aware"  
[X Link](https://x.com/anyuser/status/1880343601638699212)  2025-01-17T19:57Z [----] followers, [----] engagements


"@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations rather than just focusing on concepts and intuitions"  
[X Link](https://x.com/anyuser/status/1786444278383984657)  2024-05-03T17:14Z [----] followers, [----] engagements


"During the period we're worried about by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories focusing on insider threats"  
[X Link](https://x.com/anyuser/status/1799582505210941468)  2024-06-08T23:21Z [----] followers, [----] engagements


"So you try to call the US President and they hear you out and then they go off and call one of your competitors to ask for their take. They say something like this:"  
[X Link](https://x.com/anyuser/status/1828112250030215239)  2024-08-26T16:48Z [----] followers, [----] engagements


"Slightly longer version of this argument here: https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai"  
[X Link](https://x.com/anyuser/status/1828112892115460500)  2024-08-26T16:50Z [----] followers, [----] engagements


"@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)"  
[X Link](https://x.com/anyuser/status/1786442372622234032)  2024-05-03T17:07Z [----] followers, [----] engagements


"When thinking about techniques that aim to reduce risk from misaligned AIs I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting because different classes of safety techniques are appropriate for different risk levels. Four that I think it's particularly useful to think about: - The safety case regime. Sometimes people talk about approaches to safety such that if all AI developers followed these approaches the overall level of risk posed by AI would be minimal. This is the level of caution people are usually talking about when they"  
[X Link](https://x.com/anyuser/status/1820560666807402622)  2024-08-05T20:41Z [----] followers, [----] engagements


"When you're analyzing safety measures that aim to prevent a malicious AI from causing sudden catastrophes it's crucial to reason separately about two ways the AI can fail: getting caught (in which case it can't try again) or failing silently (in which case it can)"  
[X Link](https://x.com/anyuser/status/1857454649277841530)  2024-11-15T16:04Z [----] followers, [----] engagements


"@ciphergoth It has my sudo password on the desktop but not on my laptop"  
[X Link](https://x.com/anyuser/status/1840598659538518320)  2024-09-30T03:44Z [----] followers, 14.6K engagements


"Aryan Bhatt who now leads Redwood's high-stakes control research came to us through the Astra fellowship. These programs are great for finding and developing AI safety talent Great to see Astra returning. 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month"  
[X Link](https://x.com/anyuser/status/1961127763282469188)  2025-08-28T18:04Z [----] followers, [----] engagements


"🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo"  
[X Link](https://x.com/anyuser/status/1961119730183348279)  2025-08-28T17:32Z [----] followers, 73.3K engagements


"@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them"  
[X Link](https://x.com/anyuser/status/1797691550086099166)  2024-06-03T18:07Z [----] followers, [----] engagements


"In particular in this multi-step setting we get the best results with time-travel methods that exploit our ability to reset the AI to a previous step allowing us to incriminate it more reliably than wed be able to do with existing approaches"  
[X Link](https://x.com/anyuser/status/1912543888285528370)  2025-04-16T16:29Z [----] followers, [----] engagements


"The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldnt work (because theyd be too labor-intensive or have who-guards-the-guards problems) if you just had humans"  
[X Link](https://x.com/anyuser/status/1799582618117361969)  2024-06-08T23:21Z [----] followers, [----] engagements


"3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions so people err broad. AI can enable much more precise permissioning"  
[X Link](https://x.com/anyuser/status/1799582572953141756)  2024-06-08T23:21Z [----] followers, [----] engagements


"I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation"  
[X Link](https://x.com/anyuser/status/1727732235137388645)  2023-11-23T16:53Z [----] followers, [----] engagements


"@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread"  
[X Link](https://x.com/anyuser/status/1786443876154388882)  2024-05-03T17:13Z [----] followers, [----] engagements


"@captgouda24 @RichardHanania Another grandson Li Haoyi is a famous and brilliant programmer"  
[X Link](https://x.com/anyuser/status/1907149621689987278)  2025-04-01T19:14Z [----] followers, [----] engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing

@bshlgrs Buck Shlegeris

Buck Shlegeris posts on X about ai, if you, anthropic, this is the most. They currently have [-----] followers and [---] posts still getting attention that total [---] engagements in the last [--] hours.

Engagements: [---] #

[--] Week [------] +377%
[--] Month [------] +194%
[--] Months [-------] -93%
[--] Year [---------] -2.70%

Mentions: [--] #

Followers: [-----] #

[--] Week [-----] +0.28%
[--] Month [-----] +1%
[--] Months [-----] +6.20%
[--] Year [-----] +23%

CreatorRank: [---------] #

Social Influence

Social category influence technology brands finance social networks travel destinations countries stocks

Social topic influence ai, if you, anthropic #2570, this is, in the, llm, hope, solve, so i, design

Top accounts mentioned or mentioned by @ryanpgreenblatt @julianboolean @ciphergoth @simeoncps @trammel530765 @ohabryka @openai @redwoodai @ptrschmdtnlsn @anthropicai @elilifland @taoroalin @kartographien @quintinpope5 @besanushi @esyudkowsky @ratorthodox @austinc3301 @slatestarcodex @eshear

Top assets mentioned Alphabet Inc Class A (GOOGL)

Top Social Posts

Top posts by engagements in the last [--] hours

"Here's a simple question about treacherous turn strategy that I wrote (for testing AIs and maybe as an interview question for humans). GPT-4 is not very good at it"
X Link 2023-11-05T00:26Z [---] followers, [---] engagements

"If you decide which models to trust using a capability eval like this youll naturally end up segregating your models into smart untrusted models and dumb trusted models. (2/3)"
X Link 2023-11-06T23:59Z [----] followers, [---] engagements

"And so if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now a core dynamic is combining dumb trusted models and limited access to humans to make it hard for your untrusted smart models to cause problems. (3/3)"
X Link 2023-11-07T00:01Z [---] followers, [---] engagements

"But at most one of A and B can be true at a time. The problem of avoiding takeover prior to catching the first attempt only involves difficulty A but once you catch a real takeover attempt proceeding safely mostly involves difficulty B. (2/2)"
X Link 2023-12-21T22:18Z [---] followers, [---] engagements

"@taoroalin Man I would have so confidently told someone that you can't make an antimatter power plant before reading this tweet"
X Link 2023-12-22T19:47Z [---] followers, [---] engagements

"@ptrschmdtnlsn @taoroalin Ah sad"
X Link 2023-12-22T19:56Z [---] followers, [--] engagements

"@BogdanIonutCir2 @kartographien This is very quantitatively sensitive to how much caution the AI lab is allowed to employ. If they can delay by five years compared to the incautious pace and they're allowed to shut down in the 20% most-dangerous-looking worlds I find 1% takeover risk plausible"
X Link 2024-01-05T13:35Z [---] followers, [--] engagements

"@QuintinPope5 I agree with you that it's pretty obvious that these techniques wouldn't remove these backdoors but I don't think everyone believes this so it's nice that they showed it"
X Link 2024-01-12T23:50Z [--] followers, [--] engagements

"@QuintinPope5 I think you should think of this paper as being the initial basic demonstration of a methodology where they demonstrate that obvious baselines don't solve the problem in the hope of researching more novel phenomena/techniques later"
X Link 2024-01-12T23:51Z [---] followers, [--] engagements

"@BlancheMinerva @EvanHub @QuintinPope5 I think it's plausible that handling explicitly backdoored models is strictly harder than handling deceptively aligned models so if they could solve the former case they'd be confident they'd solved the latter. (Tbc I think that the former case is probably intractable)"
X Link 2024-01-12T23:53Z [---] followers, [---] engagements

"@eshear Yeah I'm like fifty-fifty on whether such AI will be a moral patient. There are also other sources of ethical obligation towards AIs e.g. we should maybe be honest towards them even if they're not moral patients. We think about this a fair bit happy to discuss"
X Link 2024-01-26T21:52Z [---] followers, [---] engagements

"@ohabryka @RokoMijic Goodharting can be handled with very different techniques and eval methodologies though so I normally think of it somewhat separately. Happy to send you relevant Google docs if you want Twitter is bad for talking about complicated things"
X Link 2024-01-29T04:04Z [---] followers, [--] engagements

"@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it"
X Link 2024-03-04T17:17Z [----] followers, [----] engagements

"@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately but Anthropic's public comms never actually said it and maybe the leadership never said it even privately"
X Link 2024-03-04T17:24Z [----] followers, 44.2K engagements

"@ohabryka I'm not trying to comment on whether it seems good or not just saying that I wouldn't call it deceptive"
X Link 2024-03-30T18:09Z [---] followers, [---] engagements

"@an_interstice @julianboolean_ We need more than [--] electrons because Pauli exclusion doesn't matter much for [--] or [--] electrons. If you want four electrons and a neutral molecule it's H4 H2He LiH or He2. Only LiH has a covalent bond. (The physics behind van der Waals is a little different)"
X Link 2024-05-03T17:40Z [----] followers, [--] engagements

"@norvid_studies "Physically what's the difference between a blue object and a green object""
X Link 2024-05-03T19:22Z [----] followers, [---] engagements

"@besanushi @OpenAI Yeah this is ridiculous it makes using the model as a classifier way harder"
X Link 2024-05-24T19:57Z [----] followers, [----] engagements

"@kartographien @besanushi @OpenAI Its not a huge problem for control specifically its just practically quite annoying. Eg when getting trusted monitor classification scores we need to sample [---] repeatedly to get good performance which is kind of annoying"
X Link 2024-05-25T02:53Z [----] followers, [---] engagements

"@kartographien @besanushi @OpenAI The logit differences are sometimes huge sometimes changing token probabilities by 10-20% iirc"
X Link 2024-05-25T02:54Z [----] followers, [---] engagements

"@iamtrask @OpenAI Please note that the GPT-2 they're talking about is GPT-2 small not GPT-2 XL (which is 12x bigger)"
X Link 2024-06-11T18:18Z [----] followers, [---] engagements

"@iamtrask @OpenAI According to Chinchilla scaling laws you should scale up the amount of data similarly to how you scale up model size. So if you can afford 144x the compute you'd train a 12x larger model on a 12x larger dataset"
X Link 2024-06-11T19:12Z [----] followers, [---] engagements

"ARC-AGIs been hyped over the last week as a benchmark that LLMs cant solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA"
X Link 2024-06-17T18:12Z [----] followers, 763.6K engagements

"@RyanPGreenblatt Our guess is that Ryans technique beats other solutions despite performing worse at the public eval because other solutions are more overfit to public eval. (But we dont know the performance of MindsAIs solution (@Jcole75Cole) which is sota on Kaggle on this eval set.)"
X Link 2024-06-27T18:43Z [----] followers, [---] engagements

"@RyanPGreenblatt @Jcole75Cole This result doesnt clarify everything but at least addresses concerns that Ryans solution is overfit because of data contamination in the data OpenAI used to pretrain GPT-4o"
X Link 2024-06-27T18:43Z [----] followers, [---] engagements

"@ohabryka @benlandautaylor The ai x-risk community mostly isnt trying to do what Ben described though"
X Link 2024-08-12T01:25Z [----] followers, [---] engagements

"I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question. Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular Why Obviously not because silicon can't implement"
X Link 2024-08-18T17:45Z [----] followers, [----] engagements

"@taoroalin @ptrschmdtnlsn Doesnt Google already substantially penalize slow websites"
X Link 2024-09-03T00:29Z [----] followers, [--] engagements

"When people propose examples of how an AI might escape the control of its developers they often describe (explicitly or implicitly) a design for the AI agent scaffold that seems to me to be quite unrealistic. I've written down a design that I think is closer to what we'll get"
X Link 2024-09-23T20:19Z [----] followers, [----] engagements

"One of the main ways this leads to confusion: people refer to some computer "where the AI agent is running" without clarifying which computer they're talking about:"
X Link 2024-09-23T20:19Z [----] followers, [---] engagements

"SB [----] still sits on Governor @GavinNewsom's desk awaiting a signature. I hope the Governor can hear through the misinformation spread by the bill's opponents and sign it. I don't trust AI developers to handle the risks they'll impose. SB [----] won't resolve this but it is a start and it will make the situation somewhat better"
X Link 2024-09-26T22:08Z [----] followers, [----] engagements

"I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun up the agent. I came back to my laptop ten minutes later to see that the agent had found the box sshd in then decided to continue: it looked around at the system info decided to upgrade a bunch of stuff including the linux kernel got impatient with apt and so investigated why it was taking so long then eventually the"
X Link 2024-09-30T02:21Z [----] followers, 730K engagements

"@trammel530765 @ciphergoth I have no idea why Claude decided to do this stuff. I agree it's a little more risk-taking than it usually is. Probably Claude just isn't used to having direct shell access"
X Link 2024-09-30T05:11Z [----] followers, [----] engagements

"@reedbndr I don't run it inside a VM because the agent needs to be able to help me with random stuff on the actual computers I work with (e.g. today I asked it to make a new user with a random password on a shared instance we use for ML research) :)"
X Link 2024-09-30T05:14Z [----] followers, 24.1K engagements

"@trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8 https://gist.github.com/bshlgrs/57323269dce828545a7edeafd9afa7e8"
X Link 2024-09-30T05:25Z [----] followers, 91.4K engagements

"A video of an old version of the scaffold (that required human consent before running the code) Waiting for human review is slow so I removed it. I expect humanity to let lots of AIs work autonomously for the same reason. https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0 https://www.dropbox.com/scl/fi/a3ellhrgmbn9r8ximrbcd/buck-scaffold-demo.movrlkey=4cb2ws4xtlpezlh0q7sxa9gcy&st=fdxpp422&dl=0"
X Link 2024-09-30T06:07Z [----] followers, 35.2K engagements

"@ESYudkowsky I feel like if the algorithmic breakthrough happens more than [--] years before AGI it doesn't really match the picture that I understood you to be presenting"
X Link 2024-10-05T04:11Z [----] followers, [---] engagements

"@ESYudkowsky Also the graph you're linking isn't describing a qualitative difference between transformers and other architectures it's describing a quantitative difference"
X Link 2024-10-05T04:31Z [----] followers, [---] engagements

"@RatOrthodox @ElijahRavitz can you link a summary of what most concerns you here"
X Link 2024-10-31T01:13Z [----] followers, [---] engagements

"@Lang__Leon @OrionJohnston I think this book is going to be aimed at a popular audience and will not contain very concrete or specific arguments of the type that would satisfy you"
X Link 2024-12-04T01:19Z [----] followers, [--] engagements

"I'm extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job. New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread) New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"
X Link 2024-12-18T17:47Z [----] followers, [----] engagements

"@robertwiblin One amusing consequence of this is that OpenAI was probably using a nontrivial fraction of their compute just on measuring o3's performance on ARC-AGI: they maybe have enough H100s to be worth $500k/hour right now so running a thousand problems takes two full-cluster hours"
X Link 2025-01-10T15:28Z [----] followers, [---] engagements

"@ptrschmdtnlsn But like can you do a scaling analysis somehow where you estimate P(white wins) as a function of centipawns with smaller centipawn advantage and then see how that scales to (+0.81 with depth 14)"
X Link 2025-02-23T18:40Z [----] followers, [---] engagements

"Weve just released the biggest and most intricate study of AI control to date in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters e.g. hacking servers theyre working on"
X Link 2025-04-16T16:29Z [----] followers, 27.7K engagements

"Over the next few weeks my wife and I will be in Noosa Istanbul and Tbilisi. We're in the market for people to meet and things to do: message me if you want to catch up or meet I'm particularly excited to jam but also happy to talk about AI"
X Link 2025-05-19T03:25Z [----] followers, [----] engagements

"New post It might make sense to pay misaligned AIs to reveal their misalignment and cooperate with us. AIs that are powerful enough to take over probably won't want to accept this kind of deal. But"
X Link 2025-06-21T00:29Z [----] followers, 12.5K engagements

"Of course there's risk of lock-in: theme changes can make it difficult for future users to themselves change the themes. For example "Make everything tiny black-on-black and unclickable""
X Link 2025-07-05T23:09Z [----] followers, [---] engagements

"@dylanmatt Thanks I got this mostly from reading "Days of Fire". You might also enjoy this passage about the creation of PEPFAR: https://forum.effectivealtruism.org/posts/Soutcw6ccs8xxyD7v/buck-s-shortformcommentId=ubyzZidqeiCG6NmTL https://forum.effectivealtruism.org/posts/Soutcw6ccs8xxyD7v/buck-s-shortformcommentId=ubyzZidqeiCG6NmTL"
X Link 2025-07-17T14:58Z [----] followers, [---] engagements

"@Mihonarium I feel like this is concern trolling; I don't think that letting IMO participants celebrate their achievement is a particularly important issue and I don't think OpenAI really owes IMO anything here"
X Link 2025-07-21T00:31Z [----] followers, [----] engagements

"@Mihonarium Why"
X Link 2025-07-21T14:46Z [----] followers, [---] engagements

"I think AI takeover risk reduces the EV of the future by less than the risk of poor human choices about what to do with the future. (Though I disagree with many things Will says here e.g. I think P(AI takeover)=35% which is substantially higher than he thinks.) I'm glad he's published this Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing https://t.co/qdQhyzlvJa Today Im releasing an essay series called Better Futures."
X Link 2025-08-04T16:51Z [----] followers, 12.4K engagements

"Yeah I think there's a lot of worlds where there's superintelligence but no AI takeover (which is a weak sense of "alignment") but people decide to do things with the future that look really dumb by my lights. Basic mechanisms are: Value disagreements Disagreements about what process they should follow/them committing early to something dumb/them "going crazy" Various acausal considerations"
X Link 2025-08-04T18:16Z [----] followers, [----] engagements

"My guess is that Alzheimers cures [--] years after full-blown superintelligence are totally plausible (and full-blown superintelligence in [--] years is 25% likely). I think my main objection to your piece is that you don't seem to reckon with the huge scale of intellectual and physical labor that could be brought to bear on this problem given access to ASI. I don't think your post engaged with my cruxes at all though perhaps you have a very different target audience"
X Link 2025-08-04T23:50Z [----] followers, [---] engagements

"The practical question which these questions feed into is: How should I feel about interventions that prevent AI takeover vs interventions that shift the distribution of power among post-singularity humans My current position is that my values are not totally orthogonal to all human values. So the human-controlled future is way more than 1e-10 as good as if I had complete control. Maybe my guess is that it's 3e-2 as good. 3e-2 is an important number because it's the conversion factor between changes in P(AI takeover) and changes in my expected share of control over the future (which are most"
X Link 2025-08-05T02:56Z [----] followers, [---] engagements

"My guess is that our disagreement here is because my default picture of a non-AI takeover future is one where people have control distributed in unfair ways that are substantially determined by stuff that happened before/during the singularity and you think that the default outcome is a more equitable distribution"
X Link 2025-08-05T02:57Z [----] followers, [---] engagements

"This role provides substantial leverage for advancing AI safety research and developing the next generation of safety researchers. The Anthropic Fellows Program is already operating at serious scale (50 fellows) and could be much bigger and better if led by the right person. Were hiring someone to run the Anthropic Fellows Program Our research collaborations have led to some of our best safety research and hires. Were looking for an exceptional ops generalist TPM or research/eng manager to help us significantly scale and improve our collabs 🧵 Were hiring someone to run the Anthropic Fellows"
X Link 2025-09-10T00:23Z [----] followers, [----] engagements

"@deanwball The model you can produce for $20 is the smallest gpt2 which is the size of gpt1. The big gpt2 gpt2-xl is about 10x bigger and 100x more expensive to train"
X Link 2024-05-29T01:56Z [----] followers, [---] engagements

"@hamandcheese @SenatorRounds @MarkWarner Looks great I find it surprising that friends of mine who love mechanism design aren't aware that whistleblower incentive programs are a large part of how financial regulation is enforced in the US"
X Link 2025-04-11T19:27Z [----] followers, [---] engagements

"@austinc3301 @ciphergoth @JacquesThibs @Aella_Girl If you enjoy that video you might enjoy McCain's concession speech https://www.youtube.com/watchv=4zH9_Q-eImQ https://www.youtube.com/watchv=4zH9_Q-eImQ"
X Link 2025-09-22T18:01Z [----] followers, [---] engagements

"@allTheYud I think it's very implausible that this sentiment is mostly downstream of anything Anthropic says"
X Link 2025-10-17T15:28Z [----] followers, [----] engagements

"@allTheYud @eshear Can you spell out the important takeaway from this"
X Link 2025-10-17T17:57Z [----] followers, [---] engagements

"I assume that you roughly know how LLM inference works right Are you proposing that it should be possible to make much smaller language models that match today's quality or are you saying that it should be possible to massively increase inference speed with current models and current hardware People are working hard on the first. Every year the size of model you need to do a particular task goes down. (By "size" I really mean active parameters.) They would obviously prioritize this somewhat harder if there was way less capital. But for what it's worth there's already a lot of competition over"
X Link 2025-11-02T15:53Z [----] followers, 12.9K engagements

"In what sense Companies that develop foundation models are incredibly strongly incentivized to compete on cost and inference speed as well as quality. Based on my personal experience talking to relevant people they are not somehow totally dropping the ball here. And ML researchers would love to come up with ways of improving cost efficiency and performance. They would just think that was super cool on a personal level"
X Link 2025-11-02T16:52Z [----] followers, [----] engagements

"@ohabryka @austinc3301 @RatOrthodox I think Ronny's post is worse than the original; I think the original is not sneering Alexander is expressing an opinion without justifying it which IMO is fine"
X Link 2025-12-17T20:05Z [----] followers, [---] engagements

"@RatOrthodox @austinc3301 It's a moratorium on building AI data centers in America which is super different than a global moratorium"
X Link 2025-12-17T23:49Z [----] followers, [---] engagements

"@RyanPGreenblatt and I recorded a podcast. We talked about: - What AI safety research makes us most envious - Whether we should be more upfront about some of our weird beliefs - Redwood Research history - Many listener questions Available on Youtube/Spotify/Substack/etc"
X Link 2026-01-06T22:43Z [----] followers, [----] engagements

"@stalkermustang @RyanPGreenblatt He doesn't bench at the moment but he estimates his 1-rep max would be 185lb. He can do one-armed pull-ups though :P"
X Link 2026-01-07T02:39Z [----] followers, [--] engagements

"Ryan did very well on this. I was top 8% (top 5% if you restrict to people who submitted before o3) which I think puts me fourth among Redwood Research staffthey are tough competition for AI forecasting My predictions for [----] look decent (I was 2/413 on the survey). I generally overestimated benchmark progress and underestimated revenue growth. Consider filling out the [----] forecasting survey (link in thread) https://t.co/PwJASw679y My predictions for [----] look decent (I was 2/413 on the survey). I generally overestimated benchmark progress and underestimated revenue growth. Consider filling"
X Link 2026-01-17T02:35Z [----] followers, [----] engagements

"@gcolbourn @willmacaskill @TomDavidsonX I'm saying that an industrial explosion might happen without having passed a point-of-no-return. Part of this is that it seems like an industrial explosion might involve AIs that aren't wildly superhuman. (Obviously all these terms are hard to operationalize.)"
X Link 2026-01-28T20:19Z [----] followers, [--] engagements

"Important new post: The arguments that AIs will pursue reward also suggest that AIs might pursue some other similar goals like "getting deployed". These similar goals have very different implications for risk so it's valuable to consider how their risk profile differs. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should also think "fitness-seekers" are plausible. But their risks aren't the same. New post: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model If you think reward-seekers are plausible you should"
X Link 2026-01-29T22:15Z [----] followers, [----] engagements

"@RyanPGreenblatt and I are going to record another podcast episode tomorrow. What should we ask each other"
X Link 2026-02-01T01:31Z [----] followers, [----] engagements

"@ohabryka And it's of course a huge driver for polarizing AI safety and adjacent topics. I can't think of an interpretation of this sentence that I agree with. You're saying that this report contributes to polarization of AIS by only citing Arxiv rather than blog posts"
X Link 2026-02-03T18:53Z [----] followers, [----] engagements

"@ajeya_cotra FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it's most likely you can get insane speed ups from LLMs while writing huge codebases"
X Link 2026-02-06T19:42Z [----] followers, [----] engagements

"I think I did actually forget to tweet about this. I did a podcast with @RyanPGreenblatt (recorded six months ago released [---] months ago) that I'm pretty happy with. unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I definitely laughed a few times. https://t.co/aTWpMlcc9A unsure why @bshlgrs didn't tweet about their podcast. I had a great time watching it. Buck and Ryan are really honest about their own work and their takes on the whole community. and I"
X Link 2026-02-07T23:20Z [----] followers, [----] engagements

"@Thomas_Woodside Why not talk to local open-source LLMs"
X Link 2026-02-13T00:00Z [----] followers, [----] engagements

"@Thomas_Woodside Yeah I was imagining talking to the AI in a way that has no retention by default"
X Link 2026-02-13T01:24Z [----] followers, [---] engagements

"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022784270977425652 https://twitter.com/i/web/status/2022784270977425652"
X Link 2026-02-14T21:25Z [----] followers, [---] engagements

"People are so unreasonably conservative about this. Its hard for a leaked API key to cost you more than a few hundred bucks; most of the ways you could leak the API keys will result in them automatically being shut off not someone malicious stealing them. I paste api keys into Claude code all the time. https://twitter.com/i/web/status/2022785775411368438 https://twitter.com/i/web/status/2022785775411368438"
X Link 2026-02-14T21:31Z [----] followers, [---] engagements

"If only Newsom hadn't vetoed SB [----] maybe I would have been protected from this outcome"
X Link 2024-09-30T02:22Z [----] followers, 61.7K engagements

"📷 Announcing ControlConf: The worlds first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if theyre trying to subvert those controls. March 27-28 [----] in London. 🧵"
X Link 2025-03-03T19:46Z [----] followers, 25.6K engagements

"If you like writing cursed AI agent code and want to develop techniques that prevent future AI agents from sabotaging the systems theyre running on you might enjoy interning with me over the winter: https://www.matsprogram.org/ https://www.matsprogram.org/"
X Link 2024-09-30T05:55Z [----] followers, 51.2K engagements

"@RyanPGreenblatt who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one) did lots of fancy tricks to get the performance this high; you can see the details on our blog"
X Link 2024-06-17T18:13Z [----] followers, 19.7K engagements

"Logs here if you need them. https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am https://x.com/bshlgrs/status/1840624013355426144 @trammel530765 @ciphergoth here you go buddy. I hope I correctly redacted everything. https://t.co/pQLkVn42Am"
X Link 2024-09-30T05:42Z [----] followers, 81.3K engagements

"Ryan's approach involves a long carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates 5k guesses selects the best ones using the examples then has a debugging step"
X Link 2024-06-17T18:13Z [----] followers, 25K engagements

"I always find it very amusing that "stop deploying AIs if you catch them trying to escape" sounds like something AI companies would obviously agree to but in practice it's fairly unlikely AI companies will make that commitment as I argued a few months ago. "If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop https://t.co/swEaixnb0p "If you literally catch your AI trying to escape you"
X Link 2024-12-03T22:45Z [----] followers, 16.9K engagements

""If you literally catch your AI trying to escape you have to stop deploying it." @bshlgrs shares strategies for managing misaligned AI including trusted monitoring and collusion-busting techniques to limit catastrophic risks as capabilities grow. #AlignmentWorkshop"
X Link 2024-12-03T16:25Z 19K followers, 195.3K engagements

"This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve"
X Link 2024-06-17T18:13Z [----] followers, 19.2K engagements

"For context ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator @fchollet claims that LLMs are unable to learn which is why they can't perform well on this benchmark"
X Link 2024-06-17T18:12Z [----] followers, 24.6K engagements

"I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. 🧵"
X Link 2024-08-26T16:47Z [----] followers, 12K engagements

"Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses"
X Link 2024-06-17T18:13Z [----] followers, 29.2K engagements

"New paper We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. 🧵"
X Link 2023-12-13T16:03Z [----] followers, 17.6K engagements

"@StephenLCasper As @EthanJPerez says you need to spend equal time on the four parts of the paper: the title the abstract the body and the tweet thread"
X Link 2025-02-21T17:03Z [----] followers, 10.9K engagements

"https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt"
X Link 2024-06-17T18:17Z [----] followers, 24.6K engagements

"Lots of people have correctly observed that around the development of AGI computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:"
X Link 2024-06-08T23:21Z [----] followers, 25.5K engagements

"The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of [---] problems that we didn't iterate on.)"
X Link 2024-06-17T18:13Z [----] followers, 25.3K engagements

"I agree with these experts that it would be good for SB [----] to pass. Ive been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer. Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://t.co/Qd8Taw7i5G Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart"
X Link 2024-08-08T18:00Z [----] followers, 38.8K engagements

"Top researchers Yoshua Bengio @geoffreyhinton Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB [----]. As of now there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers. https://time.com/7008947/california-ai-bill-letter/ https://time.com/7008947/california-ai-bill-letter/"
X Link 2024-08-08T16:14Z 111.2K followers, 232.5K engagements

"New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)"
X Link 2024-12-18T17:45Z [----] followers, 120.6K engagements

"@deanwball @stuartbuck1 @jbarro @TheStalwart @hamandcheese I feel like youre conflating the claims peoples actions can be predicted with models and peoples actions can be predicted by modeling them as rational actors who have a plan"
X Link 2025-04-03T05:01Z [----] followers, [----] engagements

"@ShakeelHashim @apolloaisafety I think your summary here is crucially misleading and very bad journalism: as others said it's crucial context that the model was told to pursue a goal at any cost"
X Link 2024-12-06T00:25Z [----] followers, [----] engagements

"New post: I think AI safety researchers should consider scenarios without almost any political will to mitigate AI misalignment risk. I describe a particular fun scenario to consider: "Ten people on the inside". https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside"
X Link 2025-01-28T16:52Z [----] followers, [----] engagements

"Read more about my main research direction here https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled"
X Link 2024-09-30T05:57Z [----] followers, 40.9K engagements

"@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics which is somewhat complicated. A rough attempt at the answer is that:"
X Link 2024-05-03T17:07Z [----] followers, 41.4K engagements

"@NeelNanda5 For uh personal reasons I am very interested in what the next page says"
X Link 2025-07-27T18:42Z [----] followers, [----] engagements

"I've been arguing for years that internal deployments are plausibly most of the risk and I've been unhappy about focus on the external deployments--they're most of the (relatively unimportant) downside of AI now but it would be a mistake to fixate on them. Glad to see this 🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment. https://t.co/ZPY4sZhTl6 🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer"
X Link 2025-04-17T23:08Z [----] followers, [----] engagements

"🧵 Today we publish a comprehensive report on "AI Behind Closed Doors: a Primer on The Governance of Internal Deployment". Our report examines a critical blind spot in current governance frameworks: internal deployment"
X Link 2025-04-17T15:07Z [----] followers, 76.9K engagements

"In a new blog post we argue that AI labs should ensure that powerful AIs are controlled. That is labs should make sure that their safety measures prevent unacceptable outcomes even if their powerful models intentionally try to subvert the safety measures. 🧵"
X Link 2024-01-25T04:52Z [----] followers, 29.9K engagements

"I'm excited for the new AI control research team at UK AISI I've been talking to them a lot; I think they're well positioned to do a bunch of useful research. I recommend applying if you're interested in AI control especially if you want to be in London Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us"
X Link 2025-02-18T18:18Z [----] followers, [----] engagements

"Im leading a new team at AISI focused on control empirics. Were hiring research engineers and research scientists and you should join us"
X Link 2025-02-18T16:12Z [----] followers, 79.9K engagements

"The best $5 I spent in a while: @firefliesai is an AI notetaker service that shows up to your video calls to transcribe. They let you pay $5/month to rename the bot. I renamed mine to "LiveTweetBot". I find this hilarious; the reactions of people I call have been mixed"
X Link 2025-03-08T19:07Z [----] followers, [----] engagements

"I've had a great time at IASEAI '25 in Paris. The room I was speaking in (at the OECD headquarters) felt like a senate committee room a huge square of desks with microphones. Luckily they allowed me to speak standing up"
X Link 2025-02-07T16:52Z [----] followers, [----] engagements

"Im with Will on this one. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view. @MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view"
X Link 2025-03-30T21:36Z [----] followers, [----] engagements

"@MatthewJBar Can we all agree to stop doing "median" and do "first quartile" timelines instead Way more informative and action-relevant in my view"
X Link 2025-03-30T18:56Z 62.9K followers, 21.6K engagements

"Today Im releasing an essay series called Better Futures. Its been something like eight years in the making so Im pretty happy its finally out It asks: when looking to the future should we focus on surviving or on flourishing"
X Link 2025-08-04T14:13Z 62.9K followers, 49.7K engagements

"If you want to contribute to AI control research consider applying to this program: some great Anthropic people are going to be supervising control projects (probably partially in collaboration with me and others at Redwood). Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers transition into"
X Link 2024-12-02T21:30Z [----] followers, [----] engagements

"Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds"
X Link 2024-12-02T18:16Z 837K followers, 506.6K engagements

"My colleagues and I are currently finishing up a paper and many blog posts on techniques for preventing models from causing catastrophic outcomes that are robust to the models intentionally trying to subvert the safety techniques. I'd love beta readers DM me if interested"
X Link 2023-12-02T01:14Z [----] followers, [----] engagements

"I'm proud that the Redwood Research blog made it onto the new @slatestarcodex blogroll. Amusing fact: Every time I post to Redwood's blog it helpfully graphs how poorly the new post is doing compared to @RyanPGreenblatt's viral ARC-AGI post; quite dispiriting"
X Link 2025-01-07T00:20Z [----] followers, [----] engagements

"Last week there was some uncertainty about whether @RyanPGreenblatt's ARC-AGI solution was really sota because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard. Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini https://t.co/lyfIKNOioL Last week @RyanPGreenblatt shared"
X Link 2024-06-27T18:42Z [----] followers, [----] engagements

"Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score excited to say his method got 42% on public tasks Were publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4 claude sonnet and gemini"
X Link 2024-06-27T17:02Z 46.8K followers, 51.1K engagements

"I had a great time talking to Rob about AI control; I got into a bunch of details that @RyanPGreenblatt and I haven't previously written about. Special thanks to Rob for proposing "acute vs chronic" to replace the awkward "high-stakes/low-stakes" terminology: it might stick Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing https://t.co/UBydDoYo3e Many are skeptical well"
X Link 2025-04-04T14:47Z [----] followers, [----] engagements

"Many are skeptical well ever 'solve AI alignment.' If so are we f**d In our interview @bshlgrs argues that there are practical non-galaxy-brained ways to reduce AI risk without knowing how to align AIs. This is 'AI control': finding ways to use advanced AI without knowing whether we can trust it. Enjoy links and transcript below: 01:51 What's AI control and why is it hot 10:44 Detecting human vs AI spies 18:10 How to catch AIs trying to escape 33:18 Cheapest AI control techniques 51:01 If we catch a model escaping. will we do anything 53:39 Getting AI models to think they've already escaped"
X Link 2025-04-04T12:35Z 46.5K followers, 53.6K engagements

"I appeared on @computer_phile explaining my favorite theoretical computer science fun fact which I learned from @ptrschmdtnlsn. https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile https://www.youtube.com/watchv=1fFO27FPM9A&t=362s&ab_channel=Computerphile"
X Link 2025-04-16T23:26Z [----] followers, [----] engagements

"We collaborated with UK AISI on a paper that goes into more detail on control evaluations for mitigating risk from scheming. I'm excited that UK AISI is starting a control team and look forward to collaborating with them 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an https://t.co/5IDKCayrq5 🧵 What safety measures prevent a misaligned LLM agent from causing a"
X Link 2025-01-30T17:23Z [----] followers, [----] engagements

"🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe How do we make a safety case demonstrating that these measures are sufficient Our new paper from @AISafetyInst and @redwood_ai sketches a part of an AI control safety case in detail proposing an evaluation that assures safety for a specific deployment context"
X Link 2025-01-30T17:17Z [----] followers, 28.2K engagements

"Jiaxin Caleb and Vivek did a great job on this new paper. Here's my summary of it: If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain https://t.co/OSzmcWGzZe If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better"
X Link 2024-11-27T02:41Z [----] followers, 10.8K engagements

"If a new Claude-N is too powerful to be trusted and may even try to bypass safety checks how can we deploy it safely We show that an adaptive deployment mechanism can save us. The longer task sequence we process the better safety-usefulness tradeoff we can obtain"
X Link 2024-11-27T02:38Z [----] followers, 25.3K engagements

"I built a tiny game where you design and fly planes. It's like a simpler shittier Kerbal Space Program with a simpler and more first-principles in its physics simulator. It has the concepts of vertices Hookean springs faces and an engine. 🧵"
X Link 2024-01-04T19:24Z [----] followers, [----] engagements

"It was very helpful that @AnthropicAI gave Ryan access to do this research"
X Link 2024-12-18T20:28Z [----] followers, [----] engagements

"Props to Anthropic for providing Ryan with employee-level model access; I'd love to see more of this type of support for independent AI safety research"
X Link 2024-12-18T20:10Z [----] followers, [----] engagements

"- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models they'll outcompete us within a year and you won't be able to get China to pause without substantial risk of war. - AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. - Even if the AI is indeed doing something systematically funny we have no evidence"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements

"This scenario does a great job spelling out what the development of really powerful AI might look like under somewhat aggressive but IMO plausible assumptions about AI progress. I think that it's like 25% likely that things progress at least this quickly. "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen https://t.co/v0V0RbFoVA "How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and"
X Link 2025-04-03T16:29Z [----] followers, [----] engagements

""How exactly could AI take over by 2027" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex @eli_lifland and @thlarsen"
X Link 2025-04-03T16:04Z 29.6K followers, 2.9M engagements

"Why Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular part of the design space reached by a particular kind of hill-climbing. There's a causal story behind how some humans and human cultures ended up nice in a place that I think would get even nicer with further intelligence enhancement; that causal story does not look convergent to me. Our particular story isn't the only"
X Link 2023-05-22T12:27Z 216.3K followers, 70K engagements

"@Simeon_Cps Do you have a citation for Anthropic saying that I was trying to track one down and couldn't find it"
X Link 2024-03-04T17:17Z [----] followers, [----] engagements

"I'm enjoying engaging with the European AI safety security and policy community and hope to do so more in the future. Some people have expressed surprise that I own a suit. I'll have you know that I own several suits; see proof attached"
X Link 2025-02-07T16:57Z [----] followers, [---] engagements

"I was also targeted by this. Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and never give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵 Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and never give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵"
X Link 2025-07-14T16:55Z [----] followers, [----] engagements

"Spearphishing PSAlooks like there's a concerted attack on AI safety/governance folks going around. Be wary of calendar links via DM and never give a 2-factor auth code over the phone. I almost got caught by thisgot a phone call last week but figured out it was sus. 🧵 And @ajeya_cotra's account has been hacked by the same folks - if you get messages from her asking to schedule a meeting be very wary She says she will never reach out to a potential grantee by Twitter always email. And @ajeya_cotra's account has been hacked by the same folks - if you get messages from her asking to schedule a"
X Link 2025-07-14T16:37Z 32.4K followers, 87.4K engagements

"I think the Iraq war has some interesting lessons for AI safety advocates. It's an example of a crazy event (9/11) leading to an extreme action (invading an unrelated country) because the crazy event empowered a pre-existing interest group (neocons)"
X Link 2025-07-10T18:55Z [----] followers, [----] engagements

"@robertwiblin Ajeya Cotra Daniel Kokotajlo redwood people"
X Link 2025-01-22T16:59Z [----] followers, [----] engagements

"@julianboolean_ Ok actually upon more thinking I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer"
X Link 2024-05-03T23:56Z [----] followers, [----] engagements

"In my research and writing on risk from AI misalignment I often talk as if catching your model scheming is a win condition. But that's not clearly true. In a new post I talk about how risk from AI misalignment is affected by AI developers deploying known-scheming models"
X Link 2025-04-18T14:40Z [----] followers, [----] engagements

"@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasnt socially connected to them (eg see rat coverage of prospera prediction markets soylent etc)"
X Link 2024-04-12T18:31Z [----] followers, [----] engagements

"Funding available from UK AISI for research on AI control You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk You'd make the control team at AISI very excited with a project proposal on one of these topics https://t.co/15tzwnQNpk"
X Link 2025-03-05T19:12Z [----] followers, [----] engagements

"You'd make the control team at AISI very excited with a project proposal on one of these topics"
X Link 2025-03-05T11:27Z [----] followers, [----] engagements

"@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion pauli exclusion and electron kinetic energy constraints" if asked to give a short technically correct answer"
X Link 2024-05-03T17:15Z [----] followers, [----] engagements

"@GavinNewsom I think AI is really cool and don't care much about any of the current harms and I think the US is usually too wary of novel technology. This is a rare exception because of the possibility for extreme catastrophic outcomes"
X Link 2024-09-26T22:11Z [----] followers, [----] engagements

"When AI companies publish about their work theyre engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in but the ambiguity leads to mismatched expectations and frustration. Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://t.co/dGaeYYRw5g"
X Link 2024-08-13T15:17Z [----] followers, [----] engagements

"Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead and how you can be appropriately skeptical of them https://asteriskmag.com/issues/07/can-you-trust-an-ai-press-release https://asteriskmag.com/issues/07/can-you-trust-an-ai-press-release"
X Link 2024-08-12T14:57Z [----] followers, 35.9K engagements

"@ESYudkowsky I'm surprised to hear you acknowledge this. I didn't know that you thought that this was a mistake. I would be very interested to hear why you think you made this incorrect prediction what you now believe and whether this has any downstream implications"
X Link 2025-08-24T14:48Z [----] followers, [----] engagements

"@TetraspaceWest We should have a paper on control techniques for bash agents out soon :)"
X Link 2024-09-30T05:34Z [----] followers, 17.3K engagements

"Our core findings: the control techniques that wed previously explored in more toy settings generalize well to this more realistic setting but we can do even better by developing novel techniques that exploit the multi-step nature of the setting"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements

"@trammel530765 @ciphergoth re missing technical details: it's a tweet not a paper dude if you want a paper on LLM agents I recommend Kapoor et al "AI Agents That Matter""
X Link 2024-09-30T05:33Z [----] followers, [----] engagements

"@Simeon_Cps Here's one related quote https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/ https://futureoflife.org/podcast/daniela-and-dario-amodei-on-anthropic/"
X Link 2024-03-04T18:04Z [----] followers, [----] engagements

"@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle"
X Link 2024-05-03T17:08Z [----] followers, [----] engagements

"We decided to use a different word because we didn't think there was a good word for the core methodology that we were following--trying to come up with safety techniques that work even if the model tries to subvert them. Here are some quotes from the original control post. Do you have a problem with how we framed it here Historically technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control. . We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished"
X Link 2025-05-08T18:14Z [----] followers, [----] engagements

"I love exposing myself to types of computer security risk that no-one has ever been threatened by. (Previously in this series: In that spirit I present a revamped version of my dating website http://reciprocity.pro https://x.com/bshlgrs/status/1840577720465645960 I asked my LLM agent (a wrapper around Claude that lets it run bash commands and see their outputs): can you ssh with the username buck to the computer on my network that is open to SSH because I didnt know the local IP of my desktop. I walked away and promptly forgot Id spun https://t.co/I6qppMZFfk http://reciprocity.pro"
X Link 2025-06-28T16:46Z [----] followers, [----] engagements

"🧵 New paper: Previous work on AI control focused on whether models can execute attack strategies. In our new work we assess their ability to generate strategies"
X Link 2024-12-20T06:06Z [----] followers, [----] engagements

"Im sympathetic to all of these arguments; the only reason Id be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely I wouldnt change my mind based on one weird observation"
X Link 2024-08-26T16:49Z [----] followers, [----] engagements

"You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements

"@NeelNanda5 @rohinmshah @Yoshua_Bengio @jacobandreas To be clear IIRC Ryan and I proposed and pushed for the external review not Anthropic"
X Link 2024-12-20T06:24Z [----] followers, [----] engagements

"And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat but that its a risk so large that its worth trading off our response to other risks that might be extremely pressing and salient"
X Link 2024-08-26T16:50Z [----] followers, [----] engagements

"@trammel530765 @ciphergoth Writing this kind of wrapper is very easy its like [---] lines of python. I ask my potential interns to implement one as part of their applications"
X Link 2024-09-30T05:51Z [----] followers, [----] engagements

"@MelindaBChu1 You can scroll down on the page Dan linked to see more names"
X Link 2024-09-09T15:54Z [----] followers, [---] engagements

"Paper link: Website: Blog post with more discussion: https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374 https://www.alignmentforum.org/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling http://bashcontrol.com https://arxiv.org/abs/2504.10374"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements

"(1/2) New post: why does the control research I've done make conservative assumptions about the AIs we're trying to control e.g. that they're optimal at planning and they're maximally situationally aware"
X Link 2025-01-17T19:57Z [----] followers, [----] engagements

"@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations rather than just focusing on concepts and intuitions"
X Link 2024-05-03T17:14Z [----] followers, [----] engagements

"During the period we're worried about by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories focusing on insider threats"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements

"So you try to call the US President and they hear you out and then they go off and call one of your competitors to ask for their take. They say something like this:"
X Link 2024-08-26T16:48Z [----] followers, [----] engagements

"Slightly longer version of this argument here: https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai https://www.alignmentforum.org/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai"
X Link 2024-08-26T16:50Z [----] followers, [----] engagements

"@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)"
X Link 2024-05-03T17:07Z [----] followers, [----] engagements

"When thinking about techniques that aim to reduce risk from misaligned AIs I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting because different classes of safety techniques are appropriate for different risk levels. Four that I think it's particularly useful to think about: - The safety case regime. Sometimes people talk about approaches to safety such that if all AI developers followed these approaches the overall level of risk posed by AI would be minimal. This is the level of caution people are usually talking about when they"
X Link 2024-08-05T20:41Z [----] followers, [----] engagements

"When you're analyzing safety measures that aim to prevent a malicious AI from causing sudden catastrophes it's crucial to reason separately about two ways the AI can fail: getting caught (in which case it can't try again) or failing silently (in which case it can)"
X Link 2024-11-15T16:04Z [----] followers, [----] engagements

"@ciphergoth It has my sudo password on the desktop but not on my laptop"
X Link 2024-09-30T03:44Z [----] followers, 14.6K engagements

"Aryan Bhatt who now leads Redwood's high-stakes control research came to us through the Astra fellowship. These programs are great for finding and developing AI safety talent Great to see Astra returning. 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month"
X Link 2025-08-28T18:04Z [----] followers, [----] engagements

"🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo"
X Link 2025-08-28T17:32Z [----] followers, 73.3K engagements

"@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them"
X Link 2024-06-03T18:07Z [----] followers, [----] engagements

"In particular in this multi-step setting we get the best results with time-travel methods that exploit our ability to reset the AI to a previous step allowing us to incriminate it more reliably than wed be able to do with existing approaches"
X Link 2025-04-16T16:29Z [----] followers, [----] engagements

"The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldnt work (because theyd be too labor-intensive or have who-guards-the-guards problems) if you just had humans"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements

"3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions so people err broad. AI can enable much more precise permissioning"
X Link 2024-06-08T23:21Z [----] followers, [----] engagements

"I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation"
X Link 2023-11-23T16:53Z [----] followers, [----] engagements

"@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread"
X Link 2024-05-03T17:13Z [----] followers, [----] engagements

"@captgouda24 @RichardHanania Another grandson Li Haoyi is a famous and brilliant programmer"
X Link 2025-04-01T19:14Z [----] followers, [----] engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing