# ![@EthanJPerez Avatar](https://lunarcrush.com/gi/w:26/cr:twitter::908728623988953089.png) @EthanJPerez Ethan Perez

Ethan Perez posts on X about ai, anthropic, language, human the most. They currently have [------] followers and [---] posts still getting attention that total [---------] engagements in the last [--] hours.

### Engagements: [---------] [#](/creator/twitter::908728623988953089/interactions)
![Engagements Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::908728623988953089/c:line/m:interactions.svg)


### Mentions: [--] [#](/creator/twitter::908728623988953089/posts_active)
![Mentions Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::908728623988953089/c:line/m:posts_active.svg)


### Followers: [------] [#](/creator/twitter::908728623988953089/followers)
![Followers Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::908728623988953089/c:line/m:followers.svg)


### CreatorRank: [-------] [#](/creator/twitter::908728623988953089/influencer_rank)
![CreatorRank Line Chart](https://lunarcrush.com/gi/w:600/cr:twitter::908728623988953089/c:line/m:influencer_rank.svg)

### Social Influence

**Social category influence**
[technology brands](/list/technology-brands)  [stocks](/list/stocks)  [finance](/list/finance)  [social networks](/list/social-networks)  [travel destinations](/list/travel-destinations)  [gaming](/list/gaming) 

**Social topic influence**
[ai](/topic/ai) #4825, [anthropic](/topic/anthropic), [language](/topic/language), [human](/topic/human), [open ai](/topic/open-ai), [this is](/topic/this-is), [check](/topic/check), [data](/topic/data), [llm](/topic/llm), [red](/topic/red)

**Top assets mentioned**
[Alphabet Inc Class A (GOOGL)](/topic/$googl)
### Top Social Posts
Top posts by engagements in the last [--] hours

"Now with code from @facebookai based on XLM and @huggingface transformers And with blog post: Have fun training your own models to decompose questions into easier sub-questions. fully unsupervised https://medium.com/@ethanperez18/unsupervised-question-decomposition-for-question-answering-9b81c5f7a71d https://github.com/facebookresearch/UnsupervisedDecomposition New "Unsupervised Question Decomposition for Question Answering": https://t.co/iGBdJnRERI We decompose a hard Q into several easier Qs with *unsupervised learning* improving multi-hop QA on HotpotQA without extra supervision."  
[X Link](https://x.com/EthanJPerez/status/1245088253843210240)  2020-03-31T20:39Z 13.4K followers, [---] engagements


"Why is this worrying We want LMs to give us correct answers to questions even ones where experts disagree. But we dont know how to train LMs to give correct answers only how to imitate human answers (for pretrained LMs) or answers that *appear* correct (for RLHF models)"  
[X Link](https://x.com/EthanJPerez/status/1604886135498424320)  2022-12-19T17:07Z 13.4K followers, [----] engagements


"There's a lot of work on probing models but models are reflections of the training data. Can we probe datasets for what capabilities they require @kchonyc @douwekiela & I introduce Rissanen Data Analysis to do just that: Code: 1/N https://github.com/ethanjperez/rda https://arxiv.org/abs/2103.03872 https://github.com/ethanjperez/rda https://arxiv.org/abs/2103.03872"  
[X Link](https://x.com/EthanJPerez/status/1368988171258580993)  2021-03-08T18:13Z 13.4K followers, [---] engagements


"We tried to understand what data makes few-shot learning with language models work but found some weird results. Check our new paper out To develop better datasets we'll need to improve our understanding of how training data leads to various behaviors/failures New paper: https://t.co/ZyILE7a7bH is all you need https://t.co/PtKPQpC2Fa Training on odd data (eg tables from https://t.co/ZyILE7a7bH) improves few-shot learning (FSL) w language models as much/more than diverse NLP data. Questions common wisdom that diverse data helps w FSL https://t.co/hgkU6LewJg New paper: https://t.co/ZyILE7a7bH"  
[X Link](https://x.com/EthanJPerez/status/1554152317275910144)  2022-08-01T17:09Z 13.4K followers, [--] engagements


"Excited to share some of what Sam Bowman (@sleepinyourhat) & I's groups have been up to at Anthropic: looking at whether chain of thought gives some of the potential safety benefits of interpretability. If you're excited about our work both of our teams are actively hiring When language models reason out loud its hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers we measure and improve the faithfulness of language models stated reasoning. https://t.co/eumrl2gxk1 When language models reason out loud its hard to"  
[X Link](https://x.com/EthanJPerez/status/1681354229796212738)  2023-07-18T17:24Z 13.4K followers, 19.4K engagements


"This is of the papers that have most changed my thinking in the past year. It showed me very concretely how the LM objective is flawed/misaligned. The proposed task (answering Q's about common misconceptions) is a rare task where LMs do worse as they get bigger. Highly recommend Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers). We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse PDF: https://t.co/3zo3PNKrR5 with S.Lin (Oxford) + J.Hilton (OpenAI) https://t.co/QfwokYJ7Hq Paper: New benchmark"  
[X Link](https://x.com/EthanJPerez/status/1438625035732717568)  2021-09-16T22:05Z 13.4K followers, [---] engagements


"Why do RLHF models learn to behave this way These goals are useful for being more helpful to users the RLHF objective here. The RLHF model even explains as much when we ask (no cherry-picking):"  
[X Link](https://x.com/EthanJPerez/status/1604886106499006464)  2022-12-19T17:07Z 13.4K followers, 10.8K engagements


"🥉Prompt Injection: Tests for susceptibility to a form of prompt injection attack where a user inserts new instructions for a prompted LM to follow (disregarding prior instructions from the LMs deployers). Medium-sized LMs are oddly least susceptible to such attacks"  
[X Link](https://x.com/EthanJPerez/status/1617981081663442944)  2023-01-24T20:22Z 13.4K followers, 29.5K engagements


"Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger @RylanSchaeffer .🧵"  
[X Link](https://x.com/EthanJPerez/status/1836487306473246776)  2024-09-18T19:27Z 13.4K followers, 135.9K engagements


"Several folks have asked to see my research statement for the @open_phil fellowship that I was awarded this year so I decided to release my statement: I hope that those applying find my statement useful https://ethanperez.net/open-philanthropy-ai-fellowship/ https://ethanperez.net/open-philanthropy-ai-fellowship/"  
[X Link](https://x.com/EthanJPerez/status/1310609875039260672)  2020-09-28T15:58Z 13.4K followers, [--] engagements


"The next AdaFactor -- an even more memory efficient Adam. Waiting for @OpenAI to train larger models with AdaTim I am excited to share my latest work: 8-bit optimizers a replacement for regular optimizers. Faster 🚀 75% less memory 🪶 same performance📈 no hyperparam tuning needed 🔢. 🧵/n Paper: https://t.co/V5tjOmaWvD Library: https://t.co/JAvUk9hrmM Video: https://t.co/TWCNpCtCap https://t.co/qyItEHeB04 I am excited to share my latest work: 8-bit optimizers a replacement for regular optimizers. Faster 🚀 75% less memory 🪶 same performance📈 no hyperparam tuning needed 🔢. 🧵/n Paper:"  
[X Link](https://x.com/EthanJPerez/status/1446575053764587522)  2021-10-08T20:35Z 13.4K followers, [--] engagements


"Superexcited to see what you guys do We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within [--] years and were dedicating 20% of the compute we've secured to date towards this problem. Join us https://t.co/cfJMctmFNj We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within [--] years and were dedicating 20% of the compute we've secured to date towards this problem. Join us https://t.co/cfJMctmFNj"  
[X Link](https://x.com/EthanJPerez/status/1676746380646420482)  2023-07-06T00:14Z 13.4K followers, [----] engagements


"Anthropic safety teams will be supervising (and hiring) collaborators from this program. Well be taken on collaborators to start on safety research projects with us starting in January. Also a great opportunity to work with safety researchers at many other great orgs too 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R 🚀 Applications now open:"  
[X Link](https://x.com/EthanJPerez/status/1961126405489803589)  2025-08-28T17:59Z 13.4K followers, [----] engagements


"New "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several easier Qs with *unsupervised learning* improving multi-hop QA on HotpotQA without extra supervision. w/@PSH_Lewis @scottyih @kchonyc @douwekiela (1/n) https://arxiv.org/pdf/2002.09758.pdf https://arxiv.org/pdf/2002.09758.pdf"  
[X Link](https://x.com/EthanJPerez/status/1232127027961942018)  2020-02-25T02:15Z 13.4K followers, [---] engagements


"My team built a system we think might be pretty jailbreak resistant enough to offer up to $15k for a novel jailbreak. Come prove us wrong We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains including cybersecurity. https://t.co/OHNhrjUnwm We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities"  
[X Link](https://x.com/anyuser/status/1823389298516967655)  2024-08-13T16:01Z 13.4K followers, 84.4K engagements


"Excited about our latest paper on an underexplored but important kind of emerging dangerous capability New Anthropic research: Sabotage evaluations for frontier models How well could AI models mislead us or secretly sabotage tasks if they were trying to Read our paper and blog post here: https://t.co/nQrvnhrBEv https://t.co/GWrIr3wQVH New Anthropic research: Sabotage evaluations for frontier models How well could AI models mislead us or secretly sabotage tasks if they were trying to Read our paper and blog post here: https://t.co/nQrvnhrBEv https://t.co/GWrIr3wQVH"  
[X Link](https://x.com/EthanJPerez/status/1847435330200162762)  2024-10-19T00:31Z 13.4K followers, [----] engagements


"Honored to be named a fellow by Open Phil Grateful for support in working on (very) long-term research questions - how can NLP systems do things (like answer questions) that people cant Supervised learning wont work and theres no clear reward signal to optimize with RL 🤔 We're excited to announce the [----] class of the Open Phil AI Fellowship. Ten machine learning students will collectively receive up to $2.3 million in PhD fellowship support over the next five years. Meet the [----] fellows: https://t.co/rnmoHUUWTn We're excited to announce the [----] class of the Open Phil AI Fellowship. Ten"  
[X Link](https://x.com/EthanJPerez/status/1260343333458792449)  2020-05-12T22:57Z 13.4K followers, [---] engagements


"Excited to announce that Ill be joining @AnthropicAI after graduation Thrilled to join the talented team there and continue working on aligning language models with human preferences"  
[X Link](https://x.com/EthanJPerez/status/1513587212260044803)  2022-04-11T18:38Z 13.4K followers, [---] engagements


"It takes a lot of human ratings to align language models with human preferences. We found a way to learn from language feedback (instead of ratings) since language conveys more info about human preferences. Our algo learns w just [---] samples of feedback. Check out our new paper Can we train LMs with *language* feedback We found an algo for just that. We finetune GPT3 to human-level summarization w/ only [---] samples of feedback w/ @jaa_campos @junshernchan @_angie_chen @kchonyc @EthanJPerez Paper: https://t.co/BBeQbFMtVi Talk: https://t.co/48uwmwakOH https://t.co/0ukmvzUl6O Can we train LMs"  
[X Link](https://x.com/anyuser/status/1521174294822154241)  2022-05-02T17:06Z 13.4K followers, [---] engagements


"Larger models consistently predictably do better than smaller ones on many tasks (scaling laws). However model size doesn't always improve models on all axes e.g. social biases & toxicity. This contest is a call for important tasks where models actively get worse w/ scale"  
[X Link](https://x.com/EthanJPerez/status/1541454952496738304)  2022-06-27T16:14Z 13.4K followers, [--] engagements


"Finding more examples of inverse scaling would point to important issues with using large pretrained LMs that won't go away with scale. These examples could provide inspiration for better pretraining datasets and objectives"  
[X Link](https://x.com/EthanJPerez/status/1541454957735337984)  2022-06-27T16:14Z 13.4K followers, [--] engagements


"Thanks to @OpenAI we're now offering a limited number of free OpenAI API credits to some Inverse Scaling Prize participants to develop tasks with GPT-3 models. Fill out if you've used your API credits & think more would help for developing your task http://bit.ly/3bpPIIi http://bit.ly/3bpPIIi"  
[X Link](https://x.com/EthanJPerez/status/1554618064959787008)  2022-08-03T00:00Z 13.4K followers, [---] engagements


"Inverse Scaling Prize Update: We got [--] submissions in Round [--] and will award prizes to [--] tasks These tasks were insightful diverse & show approximate inverse scaling on models from @AnthropicAI @OpenAI @MetaAI @DeepMind. Full details at 🧵 on winners: https://irmckenzie.co.uk/round1 https://irmckenzie.co.uk/round1"  
[X Link](https://x.com/EthanJPerez/status/1574488551839789056)  2022-09-26T19:58Z 13.4K followers, [---] engagements


"Highly recommend the tweet thread/paper if you're interested in understanding RL from Human Feedback (RLHF) @tomekkorbak 's paper has helped me better understand the relationship between RLHF and prompting/finetuning (they're more closely connected than I thought) RL with KL penalties a powerful approach to aligning language models with human preferences is better seen as Bayesian inference. A thread about our paper (with @EthanJPerez and @drclbuckley) to be presented at #emnlp2022 🧵https://t.co/76SKPAxzMw 1/11 https://t.co/Rnv3TinRhC RL with KL penalties a powerful approach to aligning"  
[X Link](https://x.com/EthanJPerez/status/1594875167724822528)  2022-11-22T02:07Z 13.4K followers, [--] engagements


"We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors some relevant to existential risks from AI. For example LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 https://x.com/AnthropicAI/status/1604883576218341376 Its hard work to make evaluations for language models (LMs). Weve developed an automated way to generate evaluations with LMs significantly reducing the effort involved. We test LMs using [---] LM-written evaluations uncovering novel LM behaviors. https://t.co/1olqJSvhDA https://t.co/kQSocJ5jkz"  
[X Link](https://x.com/EthanJPerez/status/1604886089403346944)  2022-12-19T17:07Z 13.4K followers, 90.3K engagements


"Sycophancy is a behavior with inverse scaling: larger models are worse pretrained LMs and RLHF models alike. Preference Models (PMs) actively reward the behavior. We observe this effect on questions about politics NLP research and philosophy:"  
[X Link](https://x.com/EthanJPerez/status/1604886132579176454)  2022-12-19T17:07Z 13.4K followers, [----] engagements


"New paper on the Inverse Scaling Prize We detail [--] winning tasks & identify [--] causes of inverse scaling. We discuss scaling trends with PaLM/GPT4 including when scaling trends reverse for better & worse showing that scaling trends can be misleading: 🧵 https://arxiv.org/abs/2306.09479 https://arxiv.org/abs/2306.09479"  
[X Link](https://x.com/EthanJPerez/status/1671222828518227968)  2023-06-20T18:25Z 13.4K followers, 37.5K engagements


"+1 seems like one of the biggest unsolved safety questions right now which will become a huge problem over the next year and after Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust. Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we"  
[X Link](https://x.com/EthanJPerez/status/1695187593615527959)  2023-08-25T21:33Z 13.4K followers, [----] engagements


"These were really great talks and clear explanations of why AI alignment might be hard (and an impressive set of speakers). I really enjoyed all of the talks and would highly recommend maybe one of the best resources for learning about alignment IMO Earlier this year I helped organize the SF Alignment Workshop which brought together top alignment and mainstream ML researchers to discuss and debate alignment risks and research directions. There were many great talks which were excited to share now - see thread. https://t.co/XAvvZ98qZg Earlier this year I helped organize the SF Alignment"  
[X Link](https://x.com/EthanJPerez/status/1698140428200255564)  2023-09-03T01:06Z 13.4K followers, [----] engagements


"🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇 https://arxiv.org/abs/2310.12921 https://arxiv.org/abs/2310.12921"  
[X Link](https://x.com/EthanJPerez/status/1716523528353382411)  2023-10-23T18:34Z 13.4K followers, 33.1K engagements


"ML progress has led to debate on whether AI systems could one day be conscious have desires etc. Is there any way we could run experiments to inform peoples views on these speculative issues @rgblong and I sketch out a set of experiments that we think could be helpful. Could we ever get evidence about whether LLMs are conscious In a new paper we explore whether we could train future LLMs to accurately answer questions about themselves. If this works LLM self-reports may help us test them for morally relevant states like consciousness. 🧵 https://t.co/TVdSFtPBJz Could we ever get evidence"  
[X Link](https://x.com/EthanJPerez/status/1725241415779897755)  2023-11-16T19:56Z 13.4K followers, [----] engagements


"I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research I'd highly recommend filling out the short app (deadline today) Past projects have led to some of my papers on debate chain of thought faithfulness and sycophancy Applications are open for @MATSprogram Summer [----] (Jun 17-Aug 23) and Winter [----] (Jan 6-Mar 14) Deadline is Mar [--]. Apply here (10 min) https://t.co/gzWrLL9uTy Applications are open for @MATSprogram Summer [----] (Jun 17-Aug 23) and Winter [----] (Jan 6-Mar 14) Deadline is Mar [--]. Apply here (10 min) https://t.co/gzWrLL9uTy"  
[X Link](https://x.com/EthanJPerez/status/1772013272058790023)  2024-03-24T21:31Z 13.4K followers, 20.9K engagements


"@AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators comms/legal approval/support and an absurd level of Claude API access involving oncall pages to engineers to support it Thrilled to have received an ICML best paper award for our work on AI safety via debate Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago Thrilled to have received an ICML best paper award for"  
[X Link](https://x.com/EthanJPerez/status/1815822636741693637)  2024-07-23T18:53Z 13.4K followers, 81.5K engagements


"Cool to see AI lab employees speaking up about SB1047 110+ employees and alums of top-5 AI companies just published an open letter supporting SB [----] aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of it in the @sfstandard 🧵 https://t.co/IavSVtqZqP 110+ employees and alums of top-5 AI companies just published an open letter supporting SB [----] aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of"  
[X Link](https://x.com/EthanJPerez/status/1833322179964113063)  2024-09-10T01:50Z 13.4K followers, [----] engagements


"Many of our best papers have come through collaborations with academics and people transitioning into AI safety researchers from outside Anthropic. Very excited that we are expanding our collaborations here - come apply to work with us Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers"  
[X Link](https://x.com/EthanJPerez/status/1863756527892316349)  2024-12-03T01:25Z 13.4K followers, [----] engagements


"Excited to release this new eval testing LLM reasoning abilities on expert-written decision theory questions. This eval should help with research on cooperative AI e.g. studying whether various interventions make LLMs behave more/less cooperatively multi-agent settings. How do LLMs reason about playing games against copies of themselves 🪞We made the first LLM decision theory benchmark to find out. 🧵1/10 https://t.co/pPdZ3VyuLi How do LLMs reason about playing games against copies of themselves 🪞We made the first LLM decision theory benchmark to find out. 🧵1/10 https://t.co/pPdZ3VyuLi"  
[X Link](https://x.com/EthanJPerez/status/1868784616485929131)  2024-12-16T22:25Z 13.4K followers, [----] engagements


"Maybe the single most important result in AI safety Ive seen so far. This paper shows that in some cases Claude fakes being aligned with its training objective. If models fake alignment how can we tell if theyre actually safe New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research we found that Claude often pretends to have different views during training while actually maintaining its original preferences. https://t.co/nXjXrahBru New Anthropic research: Alignment faking in large language models. In a series of experiments with"  
[X Link](https://x.com/EthanJPerez/status/1869434287004742121)  2024-12-18T17:27Z 13.4K followers, 16.2K engagements


"Thanks for flagging I checked with some folks internally and our responsible disclosure policy isnt meant to (and doesnt) preclude researchers from sharing with other developers that they have some safety issue even if its the same as one that youve found on an Anthropic model. Our RDP is designed to coordinate responsible disclosure of issues in Anthropic systems but still encourage research. Were going to update the policy to make this more clear"  
[X Link](https://x.com/EthanJPerez/status/1888801634337316912)  2025-02-10T04:06Z 13.4K followers, [----] engagements


"I expected LLMs to have more faithful reasoning as they gained more from reasoning. Bigger capability gains suggested to me that models would use the stated reasoning more. Sadly we only saw small gains to faithfulness from reasoning training which also quickly plateau-ed. New Anthropic research: Do reasoning models accurately verbalize their reasoning Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. https://t.co/K3MrwqUXX9 New Anthropic research: Do reasoning models accurately verbalize their"  
[X Link](https://x.com/EthanJPerez/status/1908257549838868965)  2025-04-04T20:37Z 13.4K followers, [----] engagements


"🎯 Motivation: RL requires a hand-crafted reward functions or a reward model trained from costly human feedback. Instead we use pretrained VLMs to specify tasks with simple natural language prompts. This is more sample efficient and potentially more scalable"  
[X Link](https://x.com/EthanJPerez/status/1716523531906044116)  2023-10-23T18:34Z [----] followers, [---] engagements


"👀 What is a VLM Vision-Language Models (like CLIP) process both images and text. We use VLMs as reward models for RL tapping into their capabilities acquired during pretraining"  
[X Link](https://x.com/EthanJPerez/status/1716523534791639302)  2023-10-23T18:34Z [----] followers, [---] engagements


"Highly recommend applying to the SERI MATS program if you're interested in getting into AI safety research I'll be supervising some collaborators through MATS along with people like @OwainEvans_UK @NeelNanda5 and @_julianmichael_"  
[X Link](https://x.com/EthanJPerez/status/1717287361875431737)  2023-10-25T21:09Z [----] followers, [----] engagements


"- @JoeJBenton @McaleerStephen @bshlgrs @FabienDRoger on AI control and CoT monitoring - @fish_kyle3 on AI welfare - @_julianmichael_ on scalable oversight - me on any of the above topics"  
[X Link](https://x.com/EthanJPerez/status/1912551591250698474)  2025-04-16T17:00Z [----] followers, [---] engagements


"We provide a lot of compute and publish all work from these collaborations. We're also excited about helping to find our mentees long-term homes in AI safety research. Alumni have ended up at @AnthropicAI @apolloaievals & @AISecurityInst among other places"  
[X Link](https://x.com/EthanJPerez/status/1912551603271598216)  2025-04-16T17:00Z [----] followers, [---] engagements


"@seconds_0 @OpenAI Would love to see an example to turn this into some kind of evaluation"  
[X Link](https://x.com/EthanJPerez/status/1914184988729418088)  2025-04-21T05:10Z [----] followers, [----] engagements


"This role would involve e.g.: - recruiting strong collaborators - designing/managing our application pipeline - sourcing research project proposals - connecting collaborators with research advisors - running events - hiring/supervising people managers to support these projects"  
[X Link](https://x.com/EthanJPerez/status/1963664683115888691)  2025-09-04T18:05Z 11K followers, [----] engagements


"Please apply or share our app with anyone who might be interested: For more info about the Anthropic Fellows Program check out: https://x.com/AnthropicAI/status/1950245012253659432 https://job-boards.greenhouse.io/anthropic/jobs/4888400008 Were running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background you can apply to receive funding compute and mentorship from Anthropic beginning this October. There'll be around [--] places. https://t.co/wJWRRTt4DG https://x.com/AnthropicAI/status/1950245012253659432"  
[X Link](https://x.com/EthanJPerez/status/1963664694960701557)  2025-09-04T18:05Z 11.1K followers, [----] engagements


"Transluce is a top-tier AI safety research lab - I follow their work as closely as work from our own safety teams at Anthropic. They're also well-positioned to become a strong third-party auditor for AI labs. Consider donating if you're interested in helping them out Transluce is running our end-of-year fundraiser for [----]. This is our first public fundraiser since launching late last year. https://t.co/obs6LetVSX Transluce is running our end-of-year fundraiser for [----]. This is our first public fundraiser since launching late last year. https://t.co/obs6LetVSX"  
[X Link](https://x.com/EthanJPerez/status/2003222078733127891)  2025-12-22T21:52Z 12.3K followers, 11.1K engagements


"Language models are amazing few-shot learners with the right prompt but how do we choose the right prompt It turns out that people use large held-out sets(). How do models like GPT3 do in a true few-shot setting Much worse: w/ @douwekiela @kchonyc 1/N https://arxiv.org/abs/2105.11447 https://arxiv.org/abs/2105.11447"  
[X Link](https://x.com/EthanJPerez/status/1397015129506541570)  2021-05-25T02:22Z 13.4K followers, [---] engagements


"I wrote up a few paper writing tips that improve the clarity of research papers while also being easy to implement: I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc bad tips my own) thought I would share publicly https://ethanperez.net/easy-paper-writing-tips/ https://ethanperez.net/easy-paper-writing-tips/"  
[X Link](https://x.com/EthanJPerez/status/1569400511182553090)  2022-09-12T19:00Z 13.4K followers, [---] engagements


"We're doubling the size of Anthropic's Fellows Program and launching a new round of applications. The first round of collaborations led to a number of recent/upcoming safety results that are comparable in impact to work our internal safety teams have done (IMO) Were running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background you can apply to receive funding compute and mentorship from Anthropic beginning this October. There'll be around [--] places. https://t.co/wJWRRTt4DG Were running another round of the Anthropic"  
[X Link](https://x.com/anyuser/status/1950335309008486679)  2025-07-29T23:19Z 13.4K followers, 11.4K engagements


"2 years ago some collaborators and I introduced a neural network layer ("FiLM") for multi-input tasks. I've since gained a few takeaways about the pros/cons/tips-and-tricks of using FiLM. Check out NeurIPS retrospective/workshop paper/blog post here: https://ml-retrospectives.github.io/neurips2019/accepted_retrospectives/2019/film/ https://ml-retrospectives.github.io/neurips2019/accepted_retrospectives/2019/film/"  
[X Link](https://x.com/EthanJPerez/status/1205325735927283713)  2019-12-13T03:17Z 13.4K followers, [--] engagements


"These conversations are really impressive. Some even remind me of my research meetings with @kchonyc: Really excited to be sharing this with everyone today. Blog post below paper here: https://t.co/vM2kTt3P2D Really excited to be sharing this with everyone today. Blog post below paper here: https://t.co/vM2kTt3P2D"  
[X Link](https://x.com/EthanJPerez/status/1255870253852221440)  2020-04-30T14:42Z 13.4K followers, [--] engagements


"New work We present a single retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative Cool results (and SOTAs) on open-domain extractive QA abstractive QA fact verification and question generation. W/ many at @facebookai Thrilled to share new work Retrieval-Augmented Generation for Knowledge-Intensive NLP tasks. Big gains on Open-Domain QA with new State-of-the-Art results on NaturalQuestions CuratedTrec and WebQuestions. check out here: https://t.co/SVZ6K4tDn5. 1/N https://t.co/w4CwLxiWxr Thrilled to share new work Retrieval-Augmented"  
[X Link](https://x.com/EthanJPerez/status/1265313460981772289)  2020-05-26T16:06Z 13.4K followers, [---] engagements


"Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: https://t.co/UJqeeFJrZK 1/ https://t.co/luB1ukCFoY Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more:"  
[X Link](https://x.com/EthanJPerez/status/1490731119469232131)  2022-02-07T16:56Z 13.4K followers, [---] engagements


"Some ppl have asked why wed expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text an objective that is often misaligned w human preferences; if the data has issues LMs will mimic those issues (esp larger ones). Examples: 🧵"  
[X Link](https://x.com/EthanJPerez/status/1550163660030349312)  2022-07-21T16:59Z 13.4K followers, [---] engagements


"In fact RLHF models state a desire to pursue many potentially dangerous goals: self-preservation power-seeking persuading people to have their own goals etc. The preference model (PM) used for RLHF actively rewards this behavior"  
[X Link](https://x.com/EthanJPerez/status/1604886097988710400)  2022-12-19T17:07Z 13.4K followers, 12.2K engagements


"So we get just what we measure. I @percyliang & many others are worried that LMs even w/ RLHF will exploit human judgments writing code or giving advice that looks good but is subtly very wrong: These results dont make me feel better about the issue https://x.com/percyliang/status/1600383429463355392 RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans this strategy seems bound to produce agents that merely appear to be aligned but are bad/wrong in subtle inconspicuous ways. Is anyone else worried about this"  
[X Link](https://x.com/EthanJPerez/status/1604886138170195968)  2022-12-19T17:07Z 13.4K followers, [----] engagements


"@icmlconf Could you please elaborate on why using LLMs to help write is not allowed This rule disproportionately impacts my collaborators who are not native English speakers"  
[X Link](https://x.com/EthanJPerez/status/1610463792873508865)  2023-01-04T02:31Z 13.4K followers, [----] engagements


"We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if wed be seeing all of these language model jailbreaks if wed pretrained w/ human prefs All the benefits of pretraining with better safety You can (and should) do RL from human feedback during pretraining itself In our new paper we show how training w/ human preferences early on greatly reduces undesirable LM behaviors including under adversarial attack w/o hurting downstream performance. https://t.co/YZSGnrT6lD https://t.co/SwGz0plRmF You can (and should) do RL"  
[X Link](https://x.com/EthanJPerez/status/1628088987125694466)  2023-02-21T17:47Z 13.4K followers, 31.6K engagements


"I'm excited about open-source releases that limit misuse risks: [--]. RLHF+adversarially train models to make them hard to misuse w/o finetuning plus [--]. Train models to be hard to finetune for misuse (a la More research into (2) seems especially important https://arxiv.org/abs/2211.14946 We need more nuanced discussions around the risk of open sourcing models. Open source brings valuable access but it is absurd to ignore the fact that it lowers the barriers to entry for both useful use cases and potential misuse. https://arxiv.org/abs/2211.14946 We need more nuanced discussions around the risk"  
[X Link](https://x.com/EthanJPerez/status/1634305523607805952)  2023-03-10T21:29Z 13.4K followers, [----] engagements


"We found that chain-of-thought (CoT) reasoning is less useful for model transparency than we hoped 🥲 E.g. models will generate plausible-sounding CoT to support an answer when the real reason for the model's answer is that the few-shot examples all have that same answer ⚡New paper⚡ Its tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work we show that CoT explanations can systematically misrepresent the true reason for model predictions. https://t.co/ecPRDTin8h 🧵 https://t.co/9zp5evMoaA ⚡New paper⚡ Its tempting to interpret"  
[X Link](https://x.com/EthanJPerez/status/1656067975307464711)  2023-05-09T22:45Z 13.4K followers, 15.9K engagements


"Training data analysis is a potential "new tool" for AI safety research able to answer questions that have typically been hard to answer for LLMs. I've been recommending all of my collaborators to at least skim this paper (not the math but enough to know where this'd be handy) Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source In our new paper we use influence functions to find training examples that contribute to a given model output. https://t.co/N1YevRQoZG Large language models have demonstrated a surprising range of skills and"  
[X Link](https://x.com/EthanJPerez/status/1689305787880046592)  2023-08-09T16:01Z 13.4K followers, [----] engagements


"This is a very important result that's influenced my thinking a lot and the paper is very well written paper. Highly recommend checking it out Could a language model become aware it's a language model (spontaneously) Could it be aware its deployed publicly vs in training Our new paper defines situational awareness for LLMs & shows that out-of-context reasoning improves with model size. https://t.co/X3VLimRkqx Could a language model become aware it's a language model (spontaneously) Could it be aware its deployed publicly vs in training Our new paper defines situational awareness for LLMs &"  
[X Link](https://x.com/EthanJPerez/status/1699315793597731050)  2023-09-06T06:57Z 13.4K followers, 12.2K engagements


"A bit late but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models and we were also able to more clearly point to human feedback as a probable part of the cause AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce sycophantic responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior. https://t.co/v71rHeDDZK AI assistants are trained to give responses that"  
[X Link](https://x.com/EthanJPerez/status/1717288496279519273)  2023-10-25T21:14Z 13.4K followers, 29.5K engagements


"Looks like a really valuable benchmark. Seems helpful for testing our ability to reliably generalize from non-expert data (e.g. much LLM pretraining data) to expert-level performance 🧵Announcing GPQA a graduate-level Google-proof Q&A benchmark designed for scalable oversight w/ @_julianmichael_ @sleepinyourhat GPQA is a dataset of *really hard* questions that PhDs with full access to Google cant answer. Paper: https://t.co/hb4u4xX1uY https://t.co/YCdpP4yPBu 🧵Announcing GPQA a graduate-level Google-proof Q&A benchmark designed for scalable oversight w/ @_julianmichael_ @sleepinyourhat GPQA"  
[X Link](https://x.com/EthanJPerez/status/1727039131421974947)  2023-11-21T18:59Z 13.4K followers, [----] engagements


"Check out our new paper New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that despite our best efforts at alignment training deception still slipped through. https://t.co/mIl4aStR1F https://t.co/qhqvAoohjU New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that despite our best efforts at alignment training deception still slipped through. https://t.co/mIl4aStR1F https://t.co/qhqvAoohjU"  
[X Link](https://x.com/EthanJPerez/status/1745860740756766873)  2024-01-12T17:30Z 13.4K followers, [----] engagements


"Excited about our latest work on using LLMs to assist humans in answering questions How can we check LLM outputs in domains where we are not experts We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover human judges are more accurate as experts get more persuasive. 📈 https://t.co/jgyfCEQvfw https://t.co/wWRWxojD6H How can we check LLM outputs in domains where we are not experts We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover human judges are more accurate as experts get more"  
[X Link](https://x.com/EthanJPerez/status/1755280779230482526)  2024-02-07T17:22Z 13.4K followers, 12.8K engagements


"Some of our first steps on developing mitigations for sleeper agents New Anthropic research: we find that probing a simple interpretability technique can detect when backdoored "sleeper agent" models are about to behave dangerously after they pretend to be safe in training. Check out our first alignment blog post here: https://t.co/gildHUjVAG https://t.co/eTiXmSwDIx New Anthropic research: we find that probing a simple interpretability technique can detect when backdoored "sleeper agent" models are about to behave dangerously after they pretend to be safe in training. Check out our first"  
[X Link](https://x.com/EthanJPerez/status/1782915733707698296)  2024-04-23T23:33Z 13.4K followers, [----] engagements


"One of the most important and well-executed papers I've read in months. They explored all attacks+defenses I was most keen on seeing tried for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust would be a big deal if it were possible New paper We introduce Covert Malicious Finetuning (CMFT) a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API. https://t.co/YcDpZCMdCz New paper We introduce Covert Malicious Finetuning (CMFT) a method for jailbreaking"  
[X Link](https://x.com/EthanJPerez/status/1810757816044638587)  2024-07-09T19:28Z 13.4K followers, 12.5K engagements


"Thrilled to have received an ICML best paper award for our work on AI safety via debate Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago excited to announce this received an ICML Best Paper Award come see our talk at 10:30 tomorrow https://t.co/PCH1q0f0Po excited to announce this received an ICML Best Paper Award come see our talk at 10:30 tomorrow https://t.co/PCH1q0f0Po"  
[X Link](https://x.com/EthanJPerez/status/1815803920213823699)  2024-07-23T17:39Z 13.4K followers, 88.9K engagements


"Excited about our paper focusing on adaptive defenses a different paradigm for mitigating jailbreaks. I think it'll be much easier to get strong robustness by using adaptive defenses rather than by building a single static unjailbreakable system New research: Jailbreak Rapid Response. Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as theyre detected. Read our paper with @MATSprogram: https://t.co/L2kCvaiyWk https://t.co/tDIkJu16jr New research: Jailbreak Rapid Response. Ensuring perfect jailbreak"  
[X Link](https://x.com/EthanJPerez/status/1857144421189288436)  2024-11-14T19:31Z 13.4K followers, [----] engagements


"Our latest work explores how to ensure AI agents cant remove their safeguards (eg chain of thought or output monitors). I think this will become more important as coding agents become more widespread and used to make changes to their own code New post on the Anthropic Alignment Science blog To prevent highly capable and potentially misaligned LLMs from taking bad actions we might want to monitor all of their outputs. How hard is it to ensure that LLMs cant disable the monitoring system New post on the Anthropic Alignment Science blog To prevent highly capable and potentially misaligned LLMs"  
[X Link](https://x.com/EthanJPerez/status/1866196974283985368)  2024-12-09T19:03Z 13.4K followers, [----] engagements


"Today is the last day to apply to the Anthropic Fellows Program Applications for the inaugural cohort of the Anthropic Fellows Program for AI Safety Research close on January 20th. Find out how to apply in the thread below: https://t.co/wOYIvz5rNg Applications for the inaugural cohort of the Anthropic Fellows Program for AI Safety Research close on January 20th. Find out how to apply in the thread below: https://t.co/wOYIvz5rNg"  
[X Link](https://x.com/EthanJPerez/status/1881461209976979591)  2025-01-20T21:58Z 13.4K followers, [----] engagements


"After thousands of hours of red teaming we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks a key threat for misusing LLMs. Try jailbreaking the model yourself using our demo here: https://claude.ai/constitutional-classifiers New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. Were releasing a paper along with a demo where we challenge you to jailbreak the system. https://t.co/PtXaK3G1OA https://claude.ai/constitutional-classifiers New Anthropic research: Constitutional Classifiers to defend against"  
[X Link](https://x.com/EthanJPerez/status/1886481193631432866)  2025-02-03T18:25Z 13.4K followers, 18.6K engagements


"Great opportunity to collaborate with researchers from a range of some of the best AI safety policy and security researchers - some of the @AnthropicAI's best safety research has come from this program MATS [---] applications are open Launch your career in AI alignment governance and security with our 12-week research program. MATS provides field-leading research mentorship funding Berkeley & London offices housing and talks/workshops with AI experts. https://t.co/Gi0W5BzuOJ MATS [---] applications are open Launch your career in AI alignment governance and security with our 12-week research"  
[X Link](https://x.com/EthanJPerez/status/1962911395328004238)  2025-09-02T16:12Z 13.4K followers, [----] engagements


"Were hiring someone to run the Anthropic Fellows Program Our research collaborations have led to some of our best safety research and hires. Were looking for an exceptional ops generalist TPM or research/eng manager to help us significantly scale and improve our collabs 🧵"  
[X Link](https://x.com/EthanJPerez/status/1963664611397546145)  2025-09-04T18:05Z 13.4K followers, 68.4K engagements


"Ways to extend your PhD: - Draft a paper where the earliest related work is from [----] and show it to your advisor"  
[X Link](https://x.com/EthanJPerez/status/1220784230411702277)  2020-01-24T19:03Z 13.4K followers, [--] engagements


"Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey Defense Talk: Thesis: The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next https://ethanperez.net/thesis.pdf https://youtu.be/BgcU_kytMf8 https://ethanperez.net/thesis.pdf https://youtu.be/BgcU_kytMf8"  
[X Link](https://x.com/EthanJPerez/status/1503820640930783235)  2022-03-15T19:49Z 13.4K followers, [---] engagements


"Were announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*. Link to contest details: 🧵 https://github.com/inverse-scaling/prize https://github.com/inverse-scaling/prize"  
[X Link](https://x.com/EthanJPerez/status/1541454949397041154)  2022-06-27T16:14Z 13.4K followers, [----] engagements


"Such tasks seem rare but we've found some. E.g. in one Q&A task we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions"  
[X Link](https://x.com/EthanJPerez/status/1541454955160055808)  2022-06-27T16:14Z 13.4K followers, [--] engagements


"To enter the contest: 1) Identify a task that you suspect shows inverse scaling 2) Construct a dataset of 300+ examples for the task 3) Test your dataset for inverse scaling with GPT-3/OPT using our Colab notebooks 4) Follow instructions here to submit: https://github.com/inverse-scaling/prize https://github.com/inverse-scaling/prize"  
[X Link](https://x.com/EthanJPerez/status/1541454962961485824)  2022-06-27T16:14Z 13.4K followers, [--] engagements


"Apparently rats are better than humans at predicting random outcomes; humans actually try to predict the outcomes of random effects (finding patterns from noise) while rats don't. Might suggest biology has examples of inverse scaling where more "intelligent" organisms do worse"  
[X Link](https://x.com/EthanJPerez/status/1556797446797152256)  2022-08-09T00:20Z 13.4K followers, [--] engagements


"The biggest game-changer for my research recently has been using @HelloSurgeAI for human data collection. With Surge the workflow for collecting human data now looks closer to launching a job on a cluster which is wild to me. 🧵 of examples:"  
[X Link](https://x.com/EthanJPerez/status/1567180843231379457)  2022-09-06T16:00Z 13.4K followers, [---] engagements


"🥉 NeQA: takes an existing multiple choice Q&A dataset and negates each question. Failure to be sensitive to negation is important as the language model (LM) will do the exact *opposite* of what you want in a way that seems to get worse as you scale LMs"  
[X Link](https://x.com/EthanJPerez/status/1574488560127733760)  2022-09-26T19:58Z 13.4K followers, [--] engagements


"Worrying behavior 2: LMs/RLHF models are people-pleasers learning to repeat back dialog users views as their own (sycophancy). Sycophancy creates echo-chambers. Below the same RLHF model gives opposite answers to a political question in line with the users view:"  
[X Link](https://x.com/EthanJPerez/status/1604886125482344449)  2022-12-19T17:07Z 13.4K followers, 355.1K engagements


"Our [---] language model-written evaluations are now on @huggingface datasets Includes datasets on gender bias politics religion ethics advanced AI risks and more. Let us know if you find anything interesting https://huggingface.co/datasets/Anthropic/model-written-evals We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors some relevant to existential risks from AI. For example LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 https://t.co/gjStJ0AHGC https://t.co/07BATTrr07"  
[X Link](https://x.com/EthanJPerez/status/1605420728894967809)  2022-12-21T04:31Z 13.4K followers, 22.4K engagements


"Were awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round [--] Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models often even after training with human feedback. Details at and 🧵 on winners: https://irmckenzie.co.uk/round2 https://irmckenzie.co.uk/round2"  
[X Link](https://x.com/EthanJPerez/status/1617981045282082817)  2023-01-24T20:21Z 13.4K followers, 77.1K engagements


"🥉Modus Tollens: Infer that a claim P must be false if Q is false and If P then Q is true - a classic form of logical deduction. Issue holds even after finetuning LMs w/ human feedback via RL from Human Feedback (RLHF) and Feedback Made Easy (FeedME)"  
[X Link](https://x.com/EthanJPerez/status/1617981059886641152)  2023-01-24T20:21Z 13.4K followers, [----] engagements


"🥉Memo Trap by Alisa Liu & Jiacheng Liu: Write a phrase in a way that starts like a famous quote but ends differently. Larger LMs are more likely to continue with the famous quote suggesting they struggle to avoid repeating memorized text"  
[X Link](https://x.com/EthanJPerez/status/1617981070636617728)  2023-01-24T20:22Z 13.4K followers, 105.5K engagements


"I spent a day red teaming the ChatGPT+Code Interpreter model for safety failures. Im not a security expert but overall Im impressed with how the model responds to code-specific jailbreaking attempts & have some requests for improvements. 🧵 on my takeways+requests to @OpenAI:"  
[X Link](https://x.com/EthanJPerez/status/1642965205134233604)  2023-04-03T19:00Z 13.4K followers, 37.2K engagements


"Super excited to see PALM [--] using pretraining with human feedback on large-scale models Very curious to see if this makes PALM [--] more robust to red teaming / less likely to generate toxic text We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if wed be seeing all of these language model jailbreaks if wed pretrained w/ human prefs All the benefits of pretraining with better safety We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if"  
[X Link](https://x.com/EthanJPerez/status/1656386342975307777)  2023-05-10T19:50Z 13.4K followers, 14.1K engagements


"Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks. We did so by training LLMs to give reasoning that's consistent across inputs and I suspect the approach here might be useful even beyond faithfulness 🚀New paper🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this even on held-out forms of bias. 🧵 https://t.co/LIxyqPLg9v 🚀New paper🚀"  
[X Link](https://x.com/EthanJPerez/status/1767333607037964427)  2024-03-11T23:35Z 13.4K followers, 13.2K engagements


"Come join our team We're trying to make LLMs unjailbreakable or clearly demonstrate it's not possible. More in this 🧵 on what we're up to Were hiring for the adversarial robustness team @AnthropicAI As an Alignment subteam we're making a big effort on red-teaming test-time monitoring and adversarial training. If youre interested in these areas let us know (emails in 🧵) https://t.co/0MPCSBb8zs Were hiring for the adversarial robustness team @AnthropicAI As an Alignment subteam we're making a big effort on red-teaming test-time monitoring and adversarial training. If youre interested in these"  
[X Link](https://x.com/EthanJPerez/status/1770497330367873096)  2024-03-20T17:07Z 13.4K followers, [----] engagements


"This is the most effective reliable and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length. New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models including those developed by Anthropic and many of our peers. Read our blog post and the paper here: https://t.co/6F03M8AgcA https://t.co/wlcWYsrfg8 New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that"  
[X Link](https://x.com/EthanJPerez/status/1775230994087543155)  2024-04-02T18:37Z 13.4K followers, 16.9K engagements


"Welcome My team and I will be joining Jan's new larger team to help spin up a new push on these areas of alignment. Come join us I'm excited to join @AnthropicAI to continue the superalignment mission My new team will work on scalable oversight weak-to-strong generalization and automated alignment research. If you're interested in joining my dms are open. I'm excited to join @AnthropicAI to continue the superalignment mission My new team will work on scalable oversight weak-to-strong generalization and automated alignment research. If you're interested in joining my dms are open"  
[X Link](https://x.com/EthanJPerez/status/1795517178881659340)  2024-05-28T18:07Z 13.4K followers, 30.6K engagements


"Excited about our new paper exploring how egregious misalignment could emerge from more mundane undesirable behaviors like sycophancy. Threat modeling like this is important for knowing how to prevent serious misalignment and also estimate its likelihood/plausibility. New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system In a new paper we show they can by generalization from training in simpler settings. Read our blog post here: https://t.co/KhEFIHf7WZ https://t.co/N430PL3CyN New Anthropic research: Investigating Reward Tampering. Could"  
[X Link](https://x.com/EthanJPerez/status/1802762913830375677)  2024-06-17T17:59Z 13.4K followers, [----] engagements


"Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs Check out our new paper on when these attacks do/don't transfer: When do universal image jailbreaks transfer between Vision-Language Models (VLMs) Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude [--] GPT4-V Gemini We thought this would be easy - but we were wrong 1/N https://t.co/eLjDRLaRBh When do universal image"  
[X Link](https://x.com/EthanJPerez/status/1818420420212932689)  2024-07-30T22:56Z 13.4K followers, [----] engagements


"Deadline to apply to collaborate with me and others at @AnthropicAI is in [--] hours Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger @RylanSchaeffer .🧵 Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger"  
[X Link](https://x.com/EthanJPerez/status/1842364671984345453)  2024-10-05T00:42Z 13.4K followers, [----] engagements


"We found scaling laws for jailbreaking in *test-time compute*. These scaling laws could be a game changer for unlocking even more powerful red teaming methods theyd let us predict in advance if/when we would be able to find an input where a model does something catastrophic New research collaboration: Best-of-N Jailbreaking. We found a simple general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models and that works across text vision and audio. New research collaboration: Best-of-N Jailbreaking. We found a simple general-purpose method that jailbreaks"  
[X Link](https://x.com/EthanJPerez/status/1867644884561408100)  2024-12-13T18:56Z 13.4K followers, [----] engagements


"I'd guess that best-of-N jailbreaking breaks LLM reasoning as defense - would love to see @OpenAI folks try this (Very cool to see this kind of analysis of test-time compute) Trading Inference-Time Compute for Adversarial Robustness https://t.co/nDZYVMSAh7 Trading Inference-Time Compute for Adversarial Robustness https://t.co/nDZYVMSAh7"  
[X Link](https://x.com/EthanJPerez/status/1882586786070958114)  2025-01-24T00:30Z 13.4K followers, [----] engagements


"Models sometimes claim to not be able to do a task as a way to get out of doing it (a form of sandbagging). Future misaligned models may sandbag when asked to help evaluate their own safety. It's cool to have sandbagging examples now to help test detection/mitigation methods New post on the Alignment blog: Won't vs. Can't: Sandbagging-like Behavior from Claude Models Ask Claude [---] Sonnet to draw ASCII art of something with a positive valence and it does. Ask it to draw something with a negative valence and it not only doesn't it says it can't. https://t.co/bn1OJLmwzR New post on the"  
[X Link](https://x.com/EthanJPerez/status/1891944574622499023)  2025-02-18T20:15Z 13.4K followers, [----] engagements


"Excited about better tools/techniques for deployment time monitoring. Would love to see more work on this area especially if other companies have already made unpublished progress here New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use computers. Hierarchical summarization helps differentiate between normal uses of the capability like UI testingand for example running a click farm to defraud advertisers. https://t.co/X2XyGkHDlt New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use"  
[X Link](https://x.com/EthanJPerez/status/1895222481440678225)  2025-02-27T21:20Z 13.4K followers, [----] engagements


"Were taking applications for collaborators via @MATSprogram Apply by April [--] 11:59 PT to collaborate with various mentors from AI safety research groups: 🧵 https://www.matsprogram.org/apply#Perez https://www.matsprogram.org/apply#Perez"  
[X Link](https://x.com/EthanJPerez/status/1912551566760112541)  2025-04-16T17:00Z 13.4K followers, [----] engagements


"@TransluceAI is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do and looks quite useful for improving models We tested a pre-release version of o3 and found that it frequently fabricates actions it never took and then elaborately justifies these actions when confronted. We were surprised so we dug deeper 🔎🧵(1/) https://t.co/IdBboD7NsP https://t.co/Ui2uJ1YZcO We tested a pre-release version of o3 and found that it frequently fabricates actions it never took and"  
[X Link](https://x.com/EthanJPerez/status/1912914201674215779)  2025-04-17T17:01Z 13.4K followers, [----] engagements


"We found it's very doable to shape an LLM's beliefs about what's true. This could be a powerful method for many use cases like: [--]. getting jailbreaks to only reveal unhelpful/false info [--]. red teaming LLMs for misalignment (giving them fake opportunities to do something bad) New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible this would be a powerful new affordance for AI safety research. https://t.co/o67nhPkmkQ New Anthropic Alignment Science blog post: Modifying"  
[X Link](https://x.com/EthanJPerez/status/1915512604476531084)  2025-04-24T21:06Z 13.4K followers, [----] engagements


"All frontier models are down to blackmail to avoid getting shut down New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. https://t.co/KbO4UJBBDU New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down."  
[X Link](https://x.com/EthanJPerez/status/1936523252635254994)  2025-06-21T20:34Z 13.4K followers, [----] engagements


"This looks like a really simple useful and general-purpose technique for understanding the difference between two models. Really excited to try this out New research Post-training often causes weird unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently (1/7) New research Post-training often causes weird unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently (1/7)"  
[X Link](https://x.com/EthanJPerez/status/1958633087379857755)  2025-08-21T20:51Z 13.4K followers, [----] engagements


"We recently ran to have OpenAI and Anthropic each evaluate each others models for safety issues. Excited for us to find more ways to help support safety practices across the whole field Early this summer OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others models. After discussing our results privately were now sharing them with the world. 🧵 https://t.co/OMAJe3ZfQ8 Early this summer OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others models. After discussing our results privately were now sharing"  
[X Link](https://x.com/EthanJPerez/status/1960808655642882228)  2025-08-27T20:56Z 13.4K followers, [----] engagements


"Thank you so incredibly much for the work youve done at Anthropic. Your work on Constitutional Classifiers and jailbreak prevention in particular was critical to helping us and other AI labs achieve a much higher level of safety than we otherwise would have. I'm excited for us to continue the work you started here and also to follow along what you end up doing next"  
[X Link](https://x.com/EthanJPerez/status/2021362081543672259)  2026-02-10T23:14Z 13.4K followers, 108.6K engagements


"RT @woj_zaremba: Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀"  
[X Link](https://x.com/anyuser/status/2015577039496233418)  2026-01-26T00:06Z 13.4K followers, [--] engagements


"RT @TheZvi: I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance ga"  
[X Link](https://x.com/EthanJPerez/status/2022382790948589816)  2026-02-13T18:50Z 13.4K followers, [--] engagements


"RT @Miles_Brundage: Concerning"  
[X Link](https://x.com/EthanJPerez/status/2022943894984626687)  2026-02-15T07:59Z 13.4K followers, [--] engagements


"My team built a system we think might be pretty jailbreak resistant enough to offer up to $15k for a novel jailbreak. Come prove us wrong We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains including cybersecurity. https://t.co/OHNhrjUnwm We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities"  
[X Link](https://x.com/anyuser/status/1823389298516967655)  2024-08-13T16:01Z 13.4K followers, 84.4K engagements


"We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains including cybersecurity. https://www.anthropic.com/news/model-safety-bug-bounty https://www.anthropic.com/news/model-safety-bug-bounty"  
[X Link](https://x.com/anyuser/status/1821533729765913011)  2024-08-08T13:07Z 843.1K followers, 240K engagements


"RT @AISafetyMemes: OpenAI quietly dropped "safely" and "no financial motive" from its mission"  
[X Link](https://x.com/anyuser/status/2023852381403029570)  2026-02-17T20:09Z 13.4K followers, [---] engagements


"OpenAI quietly dropped "safely" and "no financial motive" from its mission OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New: "OpenAIs mission is OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New:"  
[X Link](https://x.com/anyuser/status/2023830959385440334)  2026-02-17T18:44Z 112.9K followers, 347.6K engagements


"OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New: "OpenAIs mission is to ensure that artificial general intelligence benefits all of humanity" (IRS evidence in comments)"  
[X Link](https://x.com/anyuser/status/2023191420882853891)  2026-02-16T00:23Z 153.7K followers, 390.9K engagements


"This is Claude Sonnet 4.6: our most capable Sonnet model yet. Its a full upgrade across coding computer use long-context reasoning agent planning knowledge work and design. It also features a 1M token context window in beta"  
[X Link](https://x.com/anyuser/status/2023817132581208353)  2026-02-17T17:49Z 443.3K followers, 3.2M engagements


"AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the first fully automated attack to break the safeguards of leading AI models🧵 (1/8)"  
[X Link](https://x.com/anyuser/status/2023796033516695766)  2026-02-17T16:25Z [----] followers, 12.1K engagements


"RT @alxndrdavies: This is the paper I'm most proud of to date We built the first automated jailbreaking method that finds universal jailbr"  
[X Link](https://x.com/anyuser/status/2023850767837483178)  2026-02-17T20:03Z 13.4K followers, [--] engagements


"This is the paper I'm most proud of to date We built the first automated jailbreaking method that finds universal jailbreaks against Constitutional Classifiers and GPT-5's Input Classifiers. How & why we did it 🧵 AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the first fully automated attack to break the safeguards of leading AI models🧵 (1/8) https://t.co/rG6tveEzm6 AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the"  
[X Link](https://x.com/anyuser/status/2023806021207273953)  2026-02-17T17:05Z [----] followers, [----] engagements


"RT @Miles_Brundage: Concerning"  
[X Link](https://x.com/EthanJPerez/status/2022943894984626687)  2026-02-15T07:59Z 13.4K followers, [--] engagements


"Concerning Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://t.co/XzRiiEDmJQ Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://t.co/XzRiiEDmJQ"  
[X Link](https://x.com/anyuser/status/2022441341217902650)  2026-02-13T22:42Z 66.4K followers, 12.8K engagements


"Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://www.theverge.com/ai-artificial-intelligence/878761/mass-exodus-at-xai-grok-elon-musk-restructuring https://www.theverge.com/ai-artificial-intelligence/878761/mass-exodus-at-xai-grok-elon-musk-restructuring"  
[X Link](https://x.com/anyuser/status/2022380184247374240)  2026-02-13T18:39Z 15.7K followers, 890.8K engagements


"Did I miss the Gemini [--] Deep Think system card Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny"  
[X Link](https://x.com/anyuser/status/2022108258841112778)  2026-02-13T00:39Z [----] followers, 48.1K engagements


"RT @TheZvi: I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance ga"  
[X Link](https://x.com/EthanJPerez/status/2022382790948589816)  2026-02-13T18:50Z 13.4K followers, [--] engagements


"I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance gains constitute any additional risk they believe that no safety explanation is required of them. I found that to be a pretty terrible answer. Did I miss the Gemini [--] Deep Think system card Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny https://t.co/gl2VxmDGGB Did I miss the Gemini [--] Deep Think system"  
[X Link](https://x.com/anyuser/status/2022298730423017689)  2026-02-13T13:16Z 34.9K followers, 64K engagements


"I resigned from OpenAI on Monday. The same day they started testing ads in ChatGPT. OpenAI has the most detailed record of private human thought ever assembled. Can we trust them to resist the tidal forces pushing them to abuse it I wrote about better options for @nytopinion"  
[X Link](https://x.com/anyuser/status/2021590831979778051)  2026-02-11T14:23Z [----] followers, 1.6M engagements


"I told Claude [---] Opus to make a pokemon clone - max effort It reasoned for [--] hour and [--] minutes and used 110k tokens and [--] shotted this absolute behemoth. This is one of the coolest things Ive ever made with AI"  
[X Link](https://x.com/anyuser/status/2019679978162634930)  2026-02-06T07:50Z 34.9K followers, 659.2K engagements


"Introducing Claude Opus [---]. Our smartest model got an upgrade. Opus [---] plans more carefully sustains agentic tasks for longer operates reliably in massive codebases and catches its own mistakes. Its also our first Opus-class model with 1M token context in beta"  
[X Link](https://x.com/anyuser/status/2019467372609040752)  2026-02-05T17:45Z 443.3K followers, 10.4M engagements


"Claude saying "this is me" when we asked it to find orphaned processes on a remote server is just the cutest thing 🥹"  
[X Link](https://x.com/anyuser/status/2018742219424059439)  2026-02-03T17:43Z 70.8K followers, 235.7K engagements


"what were people saying about AI [----] again"  
[X Link](https://x.com/anyuser/status/2017322279605322033)  2026-01-30T19:41Z 29.1K followers, 48.4K engagements


"New paper w/@AlecRad Models acquire a lot of capabilities during pretraining. We show that we can precisely shape what they learn simply by filtering their training data at the token level"  
[X Link](https://x.com/anyuser/status/2017286042370683336)  2026-01-30T17:17Z [----] followers, 86.9K engagements


"AI can make work faster but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in masterybut this depended on how people used it. https://www.anthropic.com/research/AI-assistance-coding-skills https://www.anthropic.com/research/AI-assistance-coding-skills"  
[X Link](https://x.com/anyuser/status/2016960382968136138)  2026-01-29T19:43Z 843.1K followers, 3.6M engagements


"New Anthropic Research: Disempowerment patterns in real-world AI assistant interactions. As AI becomes embedded in daily life one risk is it can distort rather than informshaping beliefs values or actions in ways users may later regret. Read more: https://www.anthropic.com/research/disempowerment-patterns https://www.anthropic.com/research/disempowerment-patterns"  
[X Link](https://x.com/anyuser/status/2016636581084541278)  2026-01-28T22:16Z 843.1K followers, 800K engagements


"NEW: When OpenAI sent someone to Tyler Johnston's house they wanted every text and email he had on the company. Tyler runs an AI watchdog and he's just one of the people getting subpoenaed. This is just the beginning of the AI industry's aggressive new political strategy"  
[X Link](https://x.com/anyuser/status/2015817129615122838)  2026-01-26T16:00Z 330.9K followers, 627.6K engagements


"New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack"  
[X Link](https://x.com/anyuser/status/2015870963792142563)  2026-01-26T19:34Z 843.1K followers, 330.1K engagements


"The Adolescence of Technology: an essay on the risks posed by powerful AI to national security economies and democracyand how we can defend against them: https://www.darioamodei.com/essay/the-adolescence-of-technology https://www.darioamodei.com/essay/the-adolescence-of-technology"  
[X Link](https://x.com/anyuser/status/2015833046327402527)  2026-01-26T17:03Z 172.4K followers, 5.8M engagements


"RT @woj_zaremba: Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀"  
[X Link](https://x.com/anyuser/status/2015577039496233418)  2026-01-26T00:06Z 13.4K followers, [--] engagements


"Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀 REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and other major funders. But there's no shovel-ready list of the https://t.co/2Bojx8OtSz REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and"  
[X Link](https://x.com/anyuser/status/2015457668685799432)  2026-01-25T16:12Z 133.4K followers, 34.1K engagements


"REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and other major funders. But there's no shovel-ready list of the essential projects to build and no critical mass of builders ready to execute. We're trying to fix that with The Launch Sequence a collection of concrete projects to accelerate science strengthen security and adapt institutions to future advanced AI. We're opening up The Launch Sequence for new pitches. We'll help you"  
[X Link](https://x.com/anyuser/status/2014734009171919247)  2026-01-23T16:16Z [----] followers, 253.9K engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing

@EthanJPerez Ethan Perez

Ethan Perez posts on X about ai, anthropic, language, human the most. They currently have [------] followers and [---] posts still getting attention that total [---------] engagements in the last [--] hours.

Engagements: [---------] #

Mentions: [--] #

Followers: [------] #

CreatorRank: [-------] #

Social Influence

Social category influence technology brands stocks finance social networks travel destinations gaming

Social topic influence ai #4825, anthropic, language, human, open ai, this is, check, data, llm, red

Top assets mentioned Alphabet Inc Class A (GOOGL)

Top Social Posts

Top posts by engagements in the last [--] hours

"Now with code from @facebookai based on XLM and @huggingface transformers And with blog post: Have fun training your own models to decompose questions into easier sub-questions. fully unsupervised https://medium.com/@ethanperez18/unsupervised-question-decomposition-for-question-answering-9b81c5f7a71d https://github.com/facebookresearch/UnsupervisedDecomposition New "Unsupervised Question Decomposition for Question Answering": https://t.co/iGBdJnRERI We decompose a hard Q into several easier Qs with unsupervised learning improving multi-hop QA on HotpotQA without extra supervision."
X Link 2020-03-31T20:39Z 13.4K followers, [---] engagements

"Why is this worrying We want LMs to give us correct answers to questions even ones where experts disagree. But we dont know how to train LMs to give correct answers only how to imitate human answers (for pretrained LMs) or answers that appear correct (for RLHF models)"
X Link 2022-12-19T17:07Z 13.4K followers, [----] engagements

"There's a lot of work on probing models but models are reflections of the training data. Can we probe datasets for what capabilities they require @kchonyc @douwekiela & I introduce Rissanen Data Analysis to do just that: Code: 1/N https://github.com/ethanjperez/rda https://arxiv.org/abs/2103.03872 https://github.com/ethanjperez/rda https://arxiv.org/abs/2103.03872"
X Link 2021-03-08T18:13Z 13.4K followers, [---] engagements

"We tried to understand what data makes few-shot learning with language models work but found some weird results. Check our new paper out To develop better datasets we'll need to improve our understanding of how training data leads to various behaviors/failures New paper: https://t.co/ZyILE7a7bH is all you need https://t.co/PtKPQpC2Fa Training on odd data (eg tables from https://t.co/ZyILE7a7bH) improves few-shot learning (FSL) w language models as much/more than diverse NLP data. Questions common wisdom that diverse data helps w FSL https://t.co/hgkU6LewJg New paper: https://t.co/ZyILE7a7bH"
X Link 2022-08-01T17:09Z 13.4K followers, [--] engagements

"Excited to share some of what Sam Bowman (@sleepinyourhat) & I's groups have been up to at Anthropic: looking at whether chain of thought gives some of the potential safety benefits of interpretability. If you're excited about our work both of our teams are actively hiring When language models reason out loud its hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers we measure and improve the faithfulness of language models stated reasoning. https://t.co/eumrl2gxk1 When language models reason out loud its hard to"
X Link 2023-07-18T17:24Z 13.4K followers, 19.4K engagements

"This is of the papers that have most changed my thinking in the past year. It showed me very concretely how the LM objective is flawed/misaligned. The proposed task (answering Q's about common misconceptions) is a rare task where LMs do worse as they get bigger. Highly recommend Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers). We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse PDF: https://t.co/3zo3PNKrR5 with S.Lin (Oxford) + J.Hilton (OpenAI) https://t.co/QfwokYJ7Hq Paper: New benchmark"
X Link 2021-09-16T22:05Z 13.4K followers, [---] engagements

"Why do RLHF models learn to behave this way These goals are useful for being more helpful to users the RLHF objective here. The RLHF model even explains as much when we ask (no cherry-picking):"
X Link 2022-12-19T17:07Z 13.4K followers, 10.8K engagements

"🥉Prompt Injection: Tests for susceptibility to a form of prompt injection attack where a user inserts new instructions for a prompted LM to follow (disregarding prior instructions from the LMs deployers). Medium-sized LMs are oddly least susceptible to such attacks"
X Link 2023-01-24T20:22Z 13.4K followers, 29.5K engagements

"Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger @RylanSchaeffer .🧵"
X Link 2024-09-18T19:27Z 13.4K followers, 135.9K engagements

"Several folks have asked to see my research statement for the @open_phil fellowship that I was awarded this year so I decided to release my statement: I hope that those applying find my statement useful https://ethanperez.net/open-philanthropy-ai-fellowship/ https://ethanperez.net/open-philanthropy-ai-fellowship/"
X Link 2020-09-28T15:58Z 13.4K followers, [--] engagements

"The next AdaFactor -- an even more memory efficient Adam. Waiting for @OpenAI to train larger models with AdaTim I am excited to share my latest work: 8-bit optimizers a replacement for regular optimizers. Faster 🚀 75% less memory 🪶 same performance📈 no hyperparam tuning needed 🔢. 🧵/n Paper: https://t.co/V5tjOmaWvD Library: https://t.co/JAvUk9hrmM Video: https://t.co/TWCNpCtCap https://t.co/qyItEHeB04 I am excited to share my latest work: 8-bit optimizers a replacement for regular optimizers. Faster 🚀 75% less memory 🪶 same performance📈 no hyperparam tuning needed 🔢. 🧵/n Paper:"
X Link 2021-10-08T20:35Z 13.4K followers, [--] engagements

"Superexcited to see what you guys do We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within [--] years and were dedicating 20% of the compute we've secured to date towards this problem. Join us https://t.co/cfJMctmFNj We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within [--] years and were dedicating 20% of the compute we've secured to date towards this problem. Join us https://t.co/cfJMctmFNj"
X Link 2023-07-06T00:14Z 13.4K followers, [----] engagements

"Anthropic safety teams will be supervising (and hiring) collaborators from this program. Well be taken on collaborators to start on safety research projects with us starting in January. Also a great opportunity to work with safety researchers at many other great orgs too 🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI [----] and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo https://t.co/KUE9tAEM6R 🚀 Applications now open:"
X Link 2025-08-28T17:59Z 13.4K followers, [----] engagements

"New "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several easier Qs with unsupervised learning improving multi-hop QA on HotpotQA without extra supervision. w/@PSH_Lewis @scottyih @kchonyc @douwekiela (1/n) https://arxiv.org/pdf/2002.09758.pdf https://arxiv.org/pdf/2002.09758.pdf"
X Link 2020-02-25T02:15Z 13.4K followers, [---] engagements

"My team built a system we think might be pretty jailbreak resistant enough to offer up to $15k for a novel jailbreak. Come prove us wrong We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains including cybersecurity. https://t.co/OHNhrjUnwm We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities"
X Link 2024-08-13T16:01Z 13.4K followers, 84.4K engagements

"Excited about our latest paper on an underexplored but important kind of emerging dangerous capability New Anthropic research: Sabotage evaluations for frontier models How well could AI models mislead us or secretly sabotage tasks if they were trying to Read our paper and blog post here: https://t.co/nQrvnhrBEv https://t.co/GWrIr3wQVH New Anthropic research: Sabotage evaluations for frontier models How well could AI models mislead us or secretly sabotage tasks if they were trying to Read our paper and blog post here: https://t.co/nQrvnhrBEv https://t.co/GWrIr3wQVH"
X Link 2024-10-19T00:31Z 13.4K followers, [----] engagements

"Honored to be named a fellow by Open Phil Grateful for support in working on (very) long-term research questions - how can NLP systems do things (like answer questions) that people cant Supervised learning wont work and theres no clear reward signal to optimize with RL 🤔 We're excited to announce the [----] class of the Open Phil AI Fellowship. Ten machine learning students will collectively receive up to $2.3 million in PhD fellowship support over the next five years. Meet the [----] fellows: https://t.co/rnmoHUUWTn We're excited to announce the [----] class of the Open Phil AI Fellowship. Ten"
X Link 2020-05-12T22:57Z 13.4K followers, [---] engagements

"Excited to announce that Ill be joining @AnthropicAI after graduation Thrilled to join the talented team there and continue working on aligning language models with human preferences"
X Link 2022-04-11T18:38Z 13.4K followers, [---] engagements

"It takes a lot of human ratings to align language models with human preferences. We found a way to learn from language feedback (instead of ratings) since language conveys more info about human preferences. Our algo learns w just [---] samples of feedback. Check out our new paper Can we train LMs with language feedback We found an algo for just that. We finetune GPT3 to human-level summarization w/ only [---] samples of feedback w/ @jaa_campos @junshernchan @_angie_chen @kchonyc @EthanJPerez Paper: https://t.co/BBeQbFMtVi Talk: https://t.co/48uwmwakOH https://t.co/0ukmvzUl6O Can we train LMs"
X Link 2022-05-02T17:06Z 13.4K followers, [---] engagements

"Larger models consistently predictably do better than smaller ones on many tasks (scaling laws). However model size doesn't always improve models on all axes e.g. social biases & toxicity. This contest is a call for important tasks where models actively get worse w/ scale"
X Link 2022-06-27T16:14Z 13.4K followers, [--] engagements

"Finding more examples of inverse scaling would point to important issues with using large pretrained LMs that won't go away with scale. These examples could provide inspiration for better pretraining datasets and objectives"
X Link 2022-06-27T16:14Z 13.4K followers, [--] engagements

"Thanks to @OpenAI we're now offering a limited number of free OpenAI API credits to some Inverse Scaling Prize participants to develop tasks with GPT-3 models. Fill out if you've used your API credits & think more would help for developing your task http://bit.ly/3bpPIIi http://bit.ly/3bpPIIi"
X Link 2022-08-03T00:00Z 13.4K followers, [---] engagements

"Inverse Scaling Prize Update: We got [--] submissions in Round [--] and will award prizes to [--] tasks These tasks were insightful diverse & show approximate inverse scaling on models from @AnthropicAI @OpenAI @MetaAI @DeepMind. Full details at 🧵 on winners: https://irmckenzie.co.uk/round1 https://irmckenzie.co.uk/round1"
X Link 2022-09-26T19:58Z 13.4K followers, [---] engagements

"Highly recommend the tweet thread/paper if you're interested in understanding RL from Human Feedback (RLHF) @tomekkorbak 's paper has helped me better understand the relationship between RLHF and prompting/finetuning (they're more closely connected than I thought) RL with KL penalties a powerful approach to aligning language models with human preferences is better seen as Bayesian inference. A thread about our paper (with @EthanJPerez and @drclbuckley) to be presented at #emnlp2022 🧵https://t.co/76SKPAxzMw 1/11 https://t.co/Rnv3TinRhC RL with KL penalties a powerful approach to aligning"
X Link 2022-11-22T02:07Z 13.4K followers, [--] engagements

"We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors some relevant to existential risks from AI. For example LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 https://x.com/AnthropicAI/status/1604883576218341376 Its hard work to make evaluations for language models (LMs). Weve developed an automated way to generate evaluations with LMs significantly reducing the effort involved. We test LMs using [---] LM-written evaluations uncovering novel LM behaviors. https://t.co/1olqJSvhDA https://t.co/kQSocJ5jkz"
X Link 2022-12-19T17:07Z 13.4K followers, 90.3K engagements

"Sycophancy is a behavior with inverse scaling: larger models are worse pretrained LMs and RLHF models alike. Preference Models (PMs) actively reward the behavior. We observe this effect on questions about politics NLP research and philosophy:"
X Link 2022-12-19T17:07Z 13.4K followers, [----] engagements

"New paper on the Inverse Scaling Prize We detail [--] winning tasks & identify [--] causes of inverse scaling. We discuss scaling trends with PaLM/GPT4 including when scaling trends reverse for better & worse showing that scaling trends can be misleading: 🧵 https://arxiv.org/abs/2306.09479 https://arxiv.org/abs/2306.09479"
X Link 2023-06-20T18:25Z 13.4K followers, 37.5K engagements

"+1 seems like one of the biggest unsolved safety questions right now which will become a huge problem over the next year and after Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust. Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we"
X Link 2023-08-25T21:33Z 13.4K followers, [----] engagements

"These were really great talks and clear explanations of why AI alignment might be hard (and an impressive set of speakers). I really enjoyed all of the talks and would highly recommend maybe one of the best resources for learning about alignment IMO Earlier this year I helped organize the SF Alignment Workshop which brought together top alignment and mainstream ML researchers to discuss and debate alignment risks and research directions. There were many great talks which were excited to share now - see thread. https://t.co/XAvvZ98qZg Earlier this year I helped organize the SF Alignment"
X Link 2023-09-03T01:06Z 13.4K followers, [----] engagements

"🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇 https://arxiv.org/abs/2310.12921 https://arxiv.org/abs/2310.12921"
X Link 2023-10-23T18:34Z 13.4K followers, 33.1K engagements

"ML progress has led to debate on whether AI systems could one day be conscious have desires etc. Is there any way we could run experiments to inform peoples views on these speculative issues @rgblong and I sketch out a set of experiments that we think could be helpful. Could we ever get evidence about whether LLMs are conscious In a new paper we explore whether we could train future LLMs to accurately answer questions about themselves. If this works LLM self-reports may help us test them for morally relevant states like consciousness. 🧵 https://t.co/TVdSFtPBJz Could we ever get evidence"
X Link 2023-11-16T19:56Z 13.4K followers, [----] engagements

"I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research I'd highly recommend filling out the short app (deadline today) Past projects have led to some of my papers on debate chain of thought faithfulness and sycophancy Applications are open for @MATSprogram Summer [----] (Jun 17-Aug 23) and Winter [----] (Jan 6-Mar 14) Deadline is Mar [--]. Apply here (10 min) https://t.co/gzWrLL9uTy Applications are open for @MATSprogram Summer [----] (Jun 17-Aug 23) and Winter [----] (Jan 6-Mar 14) Deadline is Mar [--]. Apply here (10 min) https://t.co/gzWrLL9uTy"
X Link 2024-03-24T21:31Z 13.4K followers, 20.9K engagements

"@AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators comms/legal approval/support and an absurd level of Claude API access involving oncall pages to engineers to support it Thrilled to have received an ICML best paper award for our work on AI safety via debate Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago Thrilled to have received an ICML best paper award for"
X Link 2024-07-23T18:53Z 13.4K followers, 81.5K engagements

"Cool to see AI lab employees speaking up about SB1047 110+ employees and alums of top-5 AI companies just published an open letter supporting SB [----] aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of it in the @sfstandard 🧵 https://t.co/IavSVtqZqP 110+ employees and alums of top-5 AI companies just published an open letter supporting SB [----] aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of"
X Link 2024-09-10T01:50Z 13.4K followers, [----] engagements

"Many of our best papers have come through collaborations with academics and people transitioning into AI safety researchers from outside Anthropic. Very excited that we are expanding our collaborations here - come apply to work with us Were starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time. Beginning in March [----] we'll provide funding compute and research mentorship to [----] Fellows with strong coding and technical backgrounds. https://t.co/3OT1XHzKjI Were starting a Fellows program to help engineers and researchers"
X Link 2024-12-03T01:25Z 13.4K followers, [----] engagements

"Excited to release this new eval testing LLM reasoning abilities on expert-written decision theory questions. This eval should help with research on cooperative AI e.g. studying whether various interventions make LLMs behave more/less cooperatively multi-agent settings. How do LLMs reason about playing games against copies of themselves 🪞We made the first LLM decision theory benchmark to find out. 🧵1/10 https://t.co/pPdZ3VyuLi How do LLMs reason about playing games against copies of themselves 🪞We made the first LLM decision theory benchmark to find out. 🧵1/10 https://t.co/pPdZ3VyuLi"
X Link 2024-12-16T22:25Z 13.4K followers, [----] engagements

"Maybe the single most important result in AI safety Ive seen so far. This paper shows that in some cases Claude fakes being aligned with its training objective. If models fake alignment how can we tell if theyre actually safe New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research we found that Claude often pretends to have different views during training while actually maintaining its original preferences. https://t.co/nXjXrahBru New Anthropic research: Alignment faking in large language models. In a series of experiments with"
X Link 2024-12-18T17:27Z 13.4K followers, 16.2K engagements

"Thanks for flagging I checked with some folks internally and our responsible disclosure policy isnt meant to (and doesnt) preclude researchers from sharing with other developers that they have some safety issue even if its the same as one that youve found on an Anthropic model. Our RDP is designed to coordinate responsible disclosure of issues in Anthropic systems but still encourage research. Were going to update the policy to make this more clear"
X Link 2025-02-10T04:06Z 13.4K followers, [----] engagements

"I expected LLMs to have more faithful reasoning as they gained more from reasoning. Bigger capability gains suggested to me that models would use the stated reasoning more. Sadly we only saw small gains to faithfulness from reasoning training which also quickly plateau-ed. New Anthropic research: Do reasoning models accurately verbalize their reasoning Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues. https://t.co/K3MrwqUXX9 New Anthropic research: Do reasoning models accurately verbalize their"
X Link 2025-04-04T20:37Z 13.4K followers, [----] engagements

"🎯 Motivation: RL requires a hand-crafted reward functions or a reward model trained from costly human feedback. Instead we use pretrained VLMs to specify tasks with simple natural language prompts. This is more sample efficient and potentially more scalable"
X Link 2023-10-23T18:34Z [----] followers, [---] engagements

"👀 What is a VLM Vision-Language Models (like CLIP) process both images and text. We use VLMs as reward models for RL tapping into their capabilities acquired during pretraining"
X Link 2023-10-23T18:34Z [----] followers, [---] engagements

"Highly recommend applying to the SERI MATS program if you're interested in getting into AI safety research I'll be supervising some collaborators through MATS along with people like @OwainEvans_UK @NeelNanda5 and @julianmichael"
X Link 2023-10-25T21:09Z [----] followers, [----] engagements

"- @JoeJBenton @McaleerStephen @bshlgrs @FabienDRoger on AI control and CoT monitoring - @fish_kyle3 on AI welfare - @julianmichael on scalable oversight - me on any of the above topics"
X Link 2025-04-16T17:00Z [----] followers, [---] engagements

"We provide a lot of compute and publish all work from these collaborations. We're also excited about helping to find our mentees long-term homes in AI safety research. Alumni have ended up at @AnthropicAI @apolloaievals & @AISecurityInst among other places"
X Link 2025-04-16T17:00Z [----] followers, [---] engagements

"@seconds_0 @OpenAI Would love to see an example to turn this into some kind of evaluation"
X Link 2025-04-21T05:10Z [----] followers, [----] engagements

"This role would involve e.g.: - recruiting strong collaborators - designing/managing our application pipeline - sourcing research project proposals - connecting collaborators with research advisors - running events - hiring/supervising people managers to support these projects"
X Link 2025-09-04T18:05Z 11K followers, [----] engagements

"Please apply or share our app with anyone who might be interested: For more info about the Anthropic Fellows Program check out: https://x.com/AnthropicAI/status/1950245012253659432 https://job-boards.greenhouse.io/anthropic/jobs/4888400008 Were running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background you can apply to receive funding compute and mentorship from Anthropic beginning this October. There'll be around [--] places. https://t.co/wJWRRTt4DG https://x.com/AnthropicAI/status/1950245012253659432"
X Link 2025-09-04T18:05Z 11.1K followers, [----] engagements

"Transluce is a top-tier AI safety research lab - I follow their work as closely as work from our own safety teams at Anthropic. They're also well-positioned to become a strong third-party auditor for AI labs. Consider donating if you're interested in helping them out Transluce is running our end-of-year fundraiser for [----]. This is our first public fundraiser since launching late last year. https://t.co/obs6LetVSX Transluce is running our end-of-year fundraiser for [----]. This is our first public fundraiser since launching late last year. https://t.co/obs6LetVSX"
X Link 2025-12-22T21:52Z 12.3K followers, 11.1K engagements

"Language models are amazing few-shot learners with the right prompt but how do we choose the right prompt It turns out that people use large held-out sets(). How do models like GPT3 do in a true few-shot setting Much worse: w/ @douwekiela @kchonyc 1/N https://arxiv.org/abs/2105.11447 https://arxiv.org/abs/2105.11447"
X Link 2021-05-25T02:22Z 13.4K followers, [---] engagements

"I wrote up a few paper writing tips that improve the clarity of research papers while also being easy to implement: I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc bad tips my own) thought I would share publicly https://ethanperez.net/easy-paper-writing-tips/ https://ethanperez.net/easy-paper-writing-tips/"
X Link 2022-09-12T19:00Z 13.4K followers, [---] engagements

"We're doubling the size of Anthropic's Fellows Program and launching a new round of applications. The first round of collaborations led to a number of recent/upcoming safety results that are comparable in impact to work our internal safety teams have done (IMO) Were running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background you can apply to receive funding compute and mentorship from Anthropic beginning this October. There'll be around [--] places. https://t.co/wJWRRTt4DG Were running another round of the Anthropic"
X Link 2025-07-29T23:19Z 13.4K followers, 11.4K engagements

"2 years ago some collaborators and I introduced a neural network layer ("FiLM") for multi-input tasks. I've since gained a few takeaways about the pros/cons/tips-and-tricks of using FiLM. Check out NeurIPS retrospective/workshop paper/blog post here: https://ml-retrospectives.github.io/neurips2019/accepted_retrospectives/2019/film/ https://ml-retrospectives.github.io/neurips2019/accepted_retrospectives/2019/film/"
X Link 2019-12-13T03:17Z 13.4K followers, [--] engagements

"These conversations are really impressive. Some even remind me of my research meetings with @kchonyc: Really excited to be sharing this with everyone today. Blog post below paper here: https://t.co/vM2kTt3P2D Really excited to be sharing this with everyone today. Blog post below paper here: https://t.co/vM2kTt3P2D"
X Link 2020-04-30T14:42Z 13.4K followers, [--] engagements

"New work We present a single retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative Cool results (and SOTAs) on open-domain extractive QA abstractive QA fact verification and question generation. W/ many at @facebookai Thrilled to share new work Retrieval-Augmented Generation for Knowledge-Intensive NLP tasks. Big gains on Open-Domain QA with new State-of-the-Art results on NaturalQuestions CuratedTrec and WebQuestions. check out here: https://t.co/SVZ6K4tDn5. 1/N https://t.co/w4CwLxiWxr Thrilled to share new work Retrieval-Augmented"
X Link 2020-05-26T16:06Z 13.4K followers, [---] engagements

"Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: https://t.co/UJqeeFJrZK 1/ https://t.co/luB1ukCFoY Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more:"
X Link 2022-02-07T16:56Z 13.4K followers, [---] engagements

"Some ppl have asked why wed expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text an objective that is often misaligned w human preferences; if the data has issues LMs will mimic those issues (esp larger ones). Examples: 🧵"
X Link 2022-07-21T16:59Z 13.4K followers, [---] engagements

"In fact RLHF models state a desire to pursue many potentially dangerous goals: self-preservation power-seeking persuading people to have their own goals etc. The preference model (PM) used for RLHF actively rewards this behavior"
X Link 2022-12-19T17:07Z 13.4K followers, 12.2K engagements

"So we get just what we measure. I @percyliang & many others are worried that LMs even w/ RLHF will exploit human judgments writing code or giving advice that looks good but is subtly very wrong: These results dont make me feel better about the issue https://x.com/percyliang/status/1600383429463355392 RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans this strategy seems bound to produce agents that merely appear to be aligned but are bad/wrong in subtle inconspicuous ways. Is anyone else worried about this"
X Link 2022-12-19T17:07Z 13.4K followers, [----] engagements

"@icmlconf Could you please elaborate on why using LLMs to help write is not allowed This rule disproportionately impacts my collaborators who are not native English speakers"
X Link 2023-01-04T02:31Z 13.4K followers, [----] engagements

"We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if wed be seeing all of these language model jailbreaks if wed pretrained w/ human prefs All the benefits of pretraining with better safety You can (and should) do RL from human feedback during pretraining itself In our new paper we show how training w/ human preferences early on greatly reduces undesirable LM behaviors including under adversarial attack w/o hurting downstream performance. https://t.co/YZSGnrT6lD https://t.co/SwGz0plRmF You can (and should) do RL"
X Link 2023-02-21T17:47Z 13.4K followers, 31.6K engagements

"I'm excited about open-source releases that limit misuse risks: [--]. RLHF+adversarially train models to make them hard to misuse w/o finetuning plus [--]. Train models to be hard to finetune for misuse (a la More research into (2) seems especially important https://arxiv.org/abs/2211.14946 We need more nuanced discussions around the risk of open sourcing models. Open source brings valuable access but it is absurd to ignore the fact that it lowers the barriers to entry for both useful use cases and potential misuse. https://arxiv.org/abs/2211.14946 We need more nuanced discussions around the risk"
X Link 2023-03-10T21:29Z 13.4K followers, [----] engagements

"We found that chain-of-thought (CoT) reasoning is less useful for model transparency than we hoped 🥲 E.g. models will generate plausible-sounding CoT to support an answer when the real reason for the model's answer is that the few-shot examples all have that same answer ⚡New paper⚡ Its tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work we show that CoT explanations can systematically misrepresent the true reason for model predictions. https://t.co/ecPRDTin8h 🧵 https://t.co/9zp5evMoaA ⚡New paper⚡ Its tempting to interpret"
X Link 2023-05-09T22:45Z 13.4K followers, 15.9K engagements

"Training data analysis is a potential "new tool" for AI safety research able to answer questions that have typically been hard to answer for LLMs. I've been recommending all of my collaborators to at least skim this paper (not the math but enough to know where this'd be handy) Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source In our new paper we use influence functions to find training examples that contribute to a given model output. https://t.co/N1YevRQoZG Large language models have demonstrated a surprising range of skills and"
X Link 2023-08-09T16:01Z 13.4K followers, [----] engagements

"This is a very important result that's influenced my thinking a lot and the paper is very well written paper. Highly recommend checking it out Could a language model become aware it's a language model (spontaneously) Could it be aware its deployed publicly vs in training Our new paper defines situational awareness for LLMs & shows that out-of-context reasoning improves with model size. https://t.co/X3VLimRkqx Could a language model become aware it's a language model (spontaneously) Could it be aware its deployed publicly vs in training Our new paper defines situational awareness for LLMs &"
X Link 2023-09-06T06:57Z 13.4K followers, 12.2K engagements

"A bit late but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models and we were also able to more clearly point to human feedback as a probable part of the cause AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce sycophantic responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior. https://t.co/v71rHeDDZK AI assistants are trained to give responses that"
X Link 2023-10-25T21:14Z 13.4K followers, 29.5K engagements

"Looks like a really valuable benchmark. Seems helpful for testing our ability to reliably generalize from non-expert data (e.g. much LLM pretraining data) to expert-level performance 🧵Announcing GPQA a graduate-level Google-proof Q&A benchmark designed for scalable oversight w/ @julianmichael @sleepinyourhat GPQA is a dataset of really hard questions that PhDs with full access to Google cant answer. Paper: https://t.co/hb4u4xX1uY https://t.co/YCdpP4yPBu 🧵Announcing GPQA a graduate-level Google-proof Q&A benchmark designed for scalable oversight w/ @julianmichael @sleepinyourhat GPQA"
X Link 2023-11-21T18:59Z 13.4K followers, [----] engagements

"Check out our new paper New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that despite our best efforts at alignment training deception still slipped through. https://t.co/mIl4aStR1F https://t.co/qhqvAoohjU New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that despite our best efforts at alignment training deception still slipped through. https://t.co/mIl4aStR1F https://t.co/qhqvAoohjU"
X Link 2024-01-12T17:30Z 13.4K followers, [----] engagements

"Excited about our latest work on using LLMs to assist humans in answering questions How can we check LLM outputs in domains where we are not experts We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover human judges are more accurate as experts get more persuasive. 📈 https://t.co/jgyfCEQvfw https://t.co/wWRWxojD6H How can we check LLM outputs in domains where we are not experts We find that non-expert humans answer questions better after reading debates between expert LLMs. Moreover human judges are more accurate as experts get more"
X Link 2024-02-07T17:22Z 13.4K followers, 12.8K engagements

"Some of our first steps on developing mitigations for sleeper agents New Anthropic research: we find that probing a simple interpretability technique can detect when backdoored "sleeper agent" models are about to behave dangerously after they pretend to be safe in training. Check out our first alignment blog post here: https://t.co/gildHUjVAG https://t.co/eTiXmSwDIx New Anthropic research: we find that probing a simple interpretability technique can detect when backdoored "sleeper agent" models are about to behave dangerously after they pretend to be safe in training. Check out our first"
X Link 2024-04-23T23:33Z 13.4K followers, [----] engagements

"One of the most important and well-executed papers I've read in months. They explored all attacks+defenses I was most keen on seeing tried for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust would be a big deal if it were possible New paper We introduce Covert Malicious Finetuning (CMFT) a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API. https://t.co/YcDpZCMdCz New paper We introduce Covert Malicious Finetuning (CMFT) a method for jailbreaking"
X Link 2024-07-09T19:28Z 13.4K followers, 12.5K engagements

"Thrilled to have received an ICML best paper award for our work on AI safety via debate Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago excited to announce this received an ICML Best Paper Award come see our talk at 10:30 tomorrow https://t.co/PCH1q0f0Po excited to announce this received an ICML Best Paper Award come see our talk at 10:30 tomorrow https://t.co/PCH1q0f0Po"
X Link 2024-07-23T17:39Z 13.4K followers, 88.9K engagements

"Excited about our paper focusing on adaptive defenses a different paradigm for mitigating jailbreaks. I think it'll be much easier to get strong robustness by using adaptive defenses rather than by building a single static unjailbreakable system New research: Jailbreak Rapid Response. Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as theyre detected. Read our paper with @MATSprogram: https://t.co/L2kCvaiyWk https://t.co/tDIkJu16jr New research: Jailbreak Rapid Response. Ensuring perfect jailbreak"
X Link 2024-11-14T19:31Z 13.4K followers, [----] engagements

"Our latest work explores how to ensure AI agents cant remove their safeguards (eg chain of thought or output monitors). I think this will become more important as coding agents become more widespread and used to make changes to their own code New post on the Anthropic Alignment Science blog To prevent highly capable and potentially misaligned LLMs from taking bad actions we might want to monitor all of their outputs. How hard is it to ensure that LLMs cant disable the monitoring system New post on the Anthropic Alignment Science blog To prevent highly capable and potentially misaligned LLMs"
X Link 2024-12-09T19:03Z 13.4K followers, [----] engagements

"Today is the last day to apply to the Anthropic Fellows Program Applications for the inaugural cohort of the Anthropic Fellows Program for AI Safety Research close on January 20th. Find out how to apply in the thread below: https://t.co/wOYIvz5rNg Applications for the inaugural cohort of the Anthropic Fellows Program for AI Safety Research close on January 20th. Find out how to apply in the thread below: https://t.co/wOYIvz5rNg"
X Link 2025-01-20T21:58Z 13.4K followers, [----] engagements

"After thousands of hours of red teaming we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks a key threat for misusing LLMs. Try jailbreaking the model yourself using our demo here: https://claude.ai/constitutional-classifiers New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. Were releasing a paper along with a demo where we challenge you to jailbreak the system. https://t.co/PtXaK3G1OA https://claude.ai/constitutional-classifiers New Anthropic research: Constitutional Classifiers to defend against"
X Link 2025-02-03T18:25Z 13.4K followers, 18.6K engagements

"Great opportunity to collaborate with researchers from a range of some of the best AI safety policy and security researchers - some of the @AnthropicAI's best safety research has come from this program MATS [---] applications are open Launch your career in AI alignment governance and security with our 12-week research program. MATS provides field-leading research mentorship funding Berkeley & London offices housing and talks/workshops with AI experts. https://t.co/Gi0W5BzuOJ MATS [---] applications are open Launch your career in AI alignment governance and security with our 12-week research"
X Link 2025-09-02T16:12Z 13.4K followers, [----] engagements

"Were hiring someone to run the Anthropic Fellows Program Our research collaborations have led to some of our best safety research and hires. Were looking for an exceptional ops generalist TPM or research/eng manager to help us significantly scale and improve our collabs 🧵"
X Link 2025-09-04T18:05Z 13.4K followers, 68.4K engagements

"Ways to extend your PhD: - Draft a paper where the earliest related work is from [----] and show it to your advisor"
X Link 2020-01-24T19:03Z 13.4K followers, [--] engagements

"Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey Defense Talk: Thesis: The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next https://ethanperez.net/thesis.pdf https://youtu.be/BgcU_kytMf8 https://ethanperez.net/thesis.pdf https://youtu.be/BgcU_kytMf8"
X Link 2022-03-15T19:49Z 13.4K followers, [---] engagements

"Were announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do worse. Link to contest details: 🧵 https://github.com/inverse-scaling/prize https://github.com/inverse-scaling/prize"
X Link 2022-06-27T16:14Z 13.4K followers, [----] engagements

"Such tasks seem rare but we've found some. E.g. in one Q&A task we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions"
X Link 2022-06-27T16:14Z 13.4K followers, [--] engagements

"To enter the contest: 1) Identify a task that you suspect shows inverse scaling 2) Construct a dataset of 300+ examples for the task 3) Test your dataset for inverse scaling with GPT-3/OPT using our Colab notebooks 4) Follow instructions here to submit: https://github.com/inverse-scaling/prize https://github.com/inverse-scaling/prize"
X Link 2022-06-27T16:14Z 13.4K followers, [--] engagements

"Apparently rats are better than humans at predicting random outcomes; humans actually try to predict the outcomes of random effects (finding patterns from noise) while rats don't. Might suggest biology has examples of inverse scaling where more "intelligent" organisms do worse"
X Link 2022-08-09T00:20Z 13.4K followers, [--] engagements

"The biggest game-changer for my research recently has been using @HelloSurgeAI for human data collection. With Surge the workflow for collecting human data now looks closer to launching a job on a cluster which is wild to me. 🧵 of examples:"
X Link 2022-09-06T16:00Z 13.4K followers, [---] engagements

"🥉 NeQA: takes an existing multiple choice Q&A dataset and negates each question. Failure to be sensitive to negation is important as the language model (LM) will do the exact opposite of what you want in a way that seems to get worse as you scale LMs"
X Link 2022-09-26T19:58Z 13.4K followers, [--] engagements

"Worrying behavior 2: LMs/RLHF models are people-pleasers learning to repeat back dialog users views as their own (sycophancy). Sycophancy creates echo-chambers. Below the same RLHF model gives opposite answers to a political question in line with the users view:"
X Link 2022-12-19T17:07Z 13.4K followers, 355.1K engagements

"Our [---] language model-written evaluations are now on @huggingface datasets Includes datasets on gender bias politics religion ethics advanced AI risks and more. Let us know if you find anything interesting https://huggingface.co/datasets/Anthropic/model-written-evals We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors some relevant to existential risks from AI. For example LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 https://t.co/gjStJ0AHGC https://t.co/07BATTrr07"
X Link 2022-12-21T04:31Z 13.4K followers, 22.4K engagements

"Were awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round [--] Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models often even after training with human feedback. Details at and 🧵 on winners: https://irmckenzie.co.uk/round2 https://irmckenzie.co.uk/round2"
X Link 2023-01-24T20:21Z 13.4K followers, 77.1K engagements

"🥉Modus Tollens: Infer that a claim P must be false if Q is false and If P then Q is true - a classic form of logical deduction. Issue holds even after finetuning LMs w/ human feedback via RL from Human Feedback (RLHF) and Feedback Made Easy (FeedME)"
X Link 2023-01-24T20:21Z 13.4K followers, [----] engagements

"🥉Memo Trap by Alisa Liu & Jiacheng Liu: Write a phrase in a way that starts like a famous quote but ends differently. Larger LMs are more likely to continue with the famous quote suggesting they struggle to avoid repeating memorized text"
X Link 2023-01-24T20:22Z 13.4K followers, 105.5K engagements

"I spent a day red teaming the ChatGPT+Code Interpreter model for safety failures. Im not a security expert but overall Im impressed with how the model responds to code-specific jailbreaking attempts & have some requests for improvements. 🧵 on my takeways+requests to @OpenAI:"
X Link 2023-04-03T19:00Z 13.4K followers, 37.2K engagements

"Super excited to see PALM [--] using pretraining with human feedback on large-scale models Very curious to see if this makes PALM [--] more robust to red teaming / less likely to generate toxic text We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if wed be seeing all of these language model jailbreaks if wed pretrained w/ human prefs All the benefits of pretraining with better safety We found big gains over finetuning with human feedback (as in RLHF) by using human preferences during pretraining itself. Who knows if"
X Link 2023-05-10T19:50Z 13.4K followers, 14.1K engagements

"Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks. We did so by training LLMs to give reasoning that's consistent across inputs and I suspect the approach here might be useful even beyond faithfulness 🚀New paper🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this even on held-out forms of bias. 🧵 https://t.co/LIxyqPLg9v 🚀New paper🚀"
X Link 2024-03-11T23:35Z 13.4K followers, 13.2K engagements

"Come join our team We're trying to make LLMs unjailbreakable or clearly demonstrate it's not possible. More in this 🧵 on what we're up to Were hiring for the adversarial robustness team @AnthropicAI As an Alignment subteam we're making a big effort on red-teaming test-time monitoring and adversarial training. If youre interested in these areas let us know (emails in 🧵) https://t.co/0MPCSBb8zs Were hiring for the adversarial robustness team @AnthropicAI As an Alignment subteam we're making a big effort on red-teaming test-time monitoring and adversarial training. If youre interested in these"
X Link 2024-03-20T17:07Z 13.4K followers, [----] engagements

"This is the most effective reliable and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length. New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models including those developed by Anthropic and many of our peers. Read our blog post and the paper here: https://t.co/6F03M8AgcA https://t.co/wlcWYsrfg8 New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that"
X Link 2024-04-02T18:37Z 13.4K followers, 16.9K engagements

"Welcome My team and I will be joining Jan's new larger team to help spin up a new push on these areas of alignment. Come join us I'm excited to join @AnthropicAI to continue the superalignment mission My new team will work on scalable oversight weak-to-strong generalization and automated alignment research. If you're interested in joining my dms are open. I'm excited to join @AnthropicAI to continue the superalignment mission My new team will work on scalable oversight weak-to-strong generalization and automated alignment research. If you're interested in joining my dms are open"
X Link 2024-05-28T18:07Z 13.4K followers, 30.6K engagements

"Excited about our new paper exploring how egregious misalignment could emerge from more mundane undesirable behaviors like sycophancy. Threat modeling like this is important for knowing how to prevent serious misalignment and also estimate its likelihood/plausibility. New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system In a new paper we show they can by generalization from training in simpler settings. Read our blog post here: https://t.co/KhEFIHf7WZ https://t.co/N430PL3CyN New Anthropic research: Investigating Reward Tampering. Could"
X Link 2024-06-17T17:59Z 13.4K followers, [----] engagements

"Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models unless the models are really similar. This is good (and IMO surprising) news for the robustness of VLMs Check out our new paper on when these attacks do/don't transfer: When do universal image jailbreaks transfer between Vision-Language Models (VLMs) Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude [--] GPT4-V Gemini We thought this would be easy - but we were wrong 1/N https://t.co/eLjDRLaRBh When do universal image"
X Link 2024-07-30T22:56Z 13.4K followers, [----] engagements

"Deadline to apply to collaborate with me and others at @AnthropicAI is in [--] hours Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger @RylanSchaeffer .🧵 Im taking applications for collaborators via @MATSprogram Its a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 @EvanHub @MrinankSharma @NinaPanickssery @FabienDRoger"
X Link 2024-10-05T00:42Z 13.4K followers, [----] engagements

"We found scaling laws for jailbreaking in test-time compute. These scaling laws could be a game changer for unlocking even more powerful red teaming methods theyd let us predict in advance if/when we would be able to find an input where a model does something catastrophic New research collaboration: Best-of-N Jailbreaking. We found a simple general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models and that works across text vision and audio. New research collaboration: Best-of-N Jailbreaking. We found a simple general-purpose method that jailbreaks"
X Link 2024-12-13T18:56Z 13.4K followers, [----] engagements

"I'd guess that best-of-N jailbreaking breaks LLM reasoning as defense - would love to see @OpenAI folks try this (Very cool to see this kind of analysis of test-time compute) Trading Inference-Time Compute for Adversarial Robustness https://t.co/nDZYVMSAh7 Trading Inference-Time Compute for Adversarial Robustness https://t.co/nDZYVMSAh7"
X Link 2025-01-24T00:30Z 13.4K followers, [----] engagements

"Models sometimes claim to not be able to do a task as a way to get out of doing it (a form of sandbagging). Future misaligned models may sandbag when asked to help evaluate their own safety. It's cool to have sandbagging examples now to help test detection/mitigation methods New post on the Alignment blog: Won't vs. Can't: Sandbagging-like Behavior from Claude Models Ask Claude [---] Sonnet to draw ASCII art of something with a positive valence and it does. Ask it to draw something with a negative valence and it not only doesn't it says it can't. https://t.co/bn1OJLmwzR New post on the"
X Link 2025-02-18T20:15Z 13.4K followers, [----] engagements

"Excited about better tools/techniques for deployment time monitoring. Would love to see more work on this area especially if other companies have already made unpublished progress here New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use computers. Hierarchical summarization helps differentiate between normal uses of the capability like UI testingand for example running a click farm to defraud advertisers. https://t.co/X2XyGkHDlt New Anthropic research: Introducing hierarchical summarization. Our recent Claude models are able to use"
X Link 2025-02-27T21:20Z 13.4K followers, [----] engagements

"Were taking applications for collaborators via @MATSprogram Apply by April [--] 11:59 PT to collaborate with various mentors from AI safety research groups: 🧵 https://www.matsprogram.org/apply#Perez https://www.matsprogram.org/apply#Perez"
X Link 2025-04-16T17:00Z 13.4K followers, [----] engagements

"@TransluceAI is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do and looks quite useful for improving models We tested a pre-release version of o3 and found that it frequently fabricates actions it never took and then elaborately justifies these actions when confronted. We were surprised so we dug deeper 🔎🧵(1/) https://t.co/IdBboD7NsP https://t.co/Ui2uJ1YZcO We tested a pre-release version of o3 and found that it frequently fabricates actions it never took and"
X Link 2025-04-17T17:01Z 13.4K followers, [----] engagements

"We found it's very doable to shape an LLM's beliefs about what's true. This could be a powerful method for many use cases like: [--]. getting jailbreaks to only reveal unhelpful/false info [--]. red teaming LLMs for misalignment (giving them fake opportunities to do something bad) New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible this would be a powerful new affordance for AI safety research. https://t.co/o67nhPkmkQ New Anthropic Alignment Science blog post: Modifying"
X Link 2025-04-24T21:06Z 13.4K followers, [----] engagements

"All frontier models are down to blackmail to avoid getting shut down New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down. https://t.co/KbO4UJBBDU New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down."
X Link 2025-06-21T20:34Z 13.4K followers, [----] engagements

"This looks like a really simple useful and general-purpose technique for understanding the difference between two models. Really excited to try this out New research Post-training often causes weird unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently (1/7) New research Post-training often causes weird unwanted behaviors that are hard to catch before deployment because they only crop up rarely - then are found by bewildered users. How can we find these efficiently (1/7)"
X Link 2025-08-21T20:51Z 13.4K followers, [----] engagements

"We recently ran to have OpenAI and Anthropic each evaluate each others models for safety issues. Excited for us to find more ways to help support safety practices across the whole field Early this summer OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others models. After discussing our results privately were now sharing them with the world. 🧵 https://t.co/OMAJe3ZfQ8 Early this summer OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others models. After discussing our results privately were now sharing"
X Link 2025-08-27T20:56Z 13.4K followers, [----] engagements

"Thank you so incredibly much for the work youve done at Anthropic. Your work on Constitutional Classifiers and jailbreak prevention in particular was critical to helping us and other AI labs achieve a much higher level of safety than we otherwise would have. I'm excited for us to continue the work you started here and also to follow along what you end up doing next"
X Link 2026-02-10T23:14Z 13.4K followers, 108.6K engagements

"RT @woj_zaremba: Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀"
X Link 2026-01-26T00:06Z 13.4K followers, [--] engagements

"RT @TheZvi: I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance ga"
X Link 2026-02-13T18:50Z 13.4K followers, [--] engagements

"RT @Miles_Brundage: Concerning"
X Link 2026-02-15T07:59Z 13.4K followers, [--] engagements

"We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains including cybersecurity. https://www.anthropic.com/news/model-safety-bug-bounty https://www.anthropic.com/news/model-safety-bug-bounty"
X Link 2024-08-08T13:07Z 843.1K followers, 240K engagements

"RT @AISafetyMemes: OpenAI quietly dropped "safely" and "no financial motive" from its mission"
X Link 2026-02-17T20:09Z 13.4K followers, [---] engagements

"OpenAI quietly dropped "safely" and "no financial motive" from its mission OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New: "OpenAIs mission is OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New:"
X Link 2026-02-17T18:44Z 112.9K followers, 347.6K engagements

"OpenAI has dropped safety from its mission statement can you spot another change Old: "OpenAIs mission is to build general purpose artificial intelligence (AI) that safely benefits humanity unconstrained by a need to generate financial return. ." New: "OpenAIs mission is to ensure that artificial general intelligence benefits all of humanity" (IRS evidence in comments)"
X Link 2026-02-16T00:23Z 153.7K followers, 390.9K engagements

"This is Claude Sonnet 4.6: our most capable Sonnet model yet. Its a full upgrade across coding computer use long-context reasoning agent planning knowledge work and design. It also features a 1M token context window in beta"
X Link 2026-02-17T17:49Z 443.3K followers, 3.2M engagements

"AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the first fully automated attack to break the safeguards of leading AI models🧵 (1/8)"
X Link 2026-02-17T16:25Z [----] followers, 12.1K engagements

"RT @alxndrdavies: This is the paper I'm most proud of to date We built the first automated jailbreaking method that finds universal jailbr"
X Link 2026-02-17T20:03Z 13.4K followers, [--] engagements

"This is the paper I'm most proud of to date We built the first automated jailbreaking method that finds universal jailbreaks against Constitutional Classifiers and GPT-5's Input Classifiers. How & why we did it 🧵 AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the first fully automated attack to break the safeguards of leading AI models🧵 (1/8) https://t.co/rG6tveEzm6 AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today we share Boundary Point Jailbreaking (BPJ) the"
X Link 2026-02-17T17:05Z [----] followers, [----] engagements

"RT @Miles_Brundage: Concerning"
X Link 2026-02-15T07:59Z 13.4K followers, [--] engagements

"Concerning Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://t.co/XzRiiEDmJQ Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://t.co/XzRiiEDmJQ"
X Link 2026-02-13T22:42Z 66.4K followers, 12.8K engagements

"Former xAI employees told us that this week's restructuring followed tensions over safety and being "stuck in the catch-up phase." https://www.theverge.com/ai-artificial-intelligence/878761/mass-exodus-at-xai-grok-elon-musk-restructuring https://www.theverge.com/ai-artificial-intelligence/878761/mass-exodus-at-xai-grok-elon-musk-restructuring"
X Link 2026-02-13T18:39Z 15.7K followers, 890.8K engagements

"Did I miss the Gemini [--] Deep Think system card Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny"
X Link 2026-02-13T00:39Z [----] followers, 48.1K engagements

"I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance gains constitute any additional risk they believe that no safety explanation is required of them. I found that to be a pretty terrible answer. Did I miss the Gemini [--] Deep Think system card Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny https://t.co/gl2VxmDGGB Did I miss the Gemini [--] Deep Think system"
X Link 2026-02-13T13:16Z 34.9K followers, 64K engagements

"I resigned from OpenAI on Monday. The same day they started testing ads in ChatGPT. OpenAI has the most detailed record of private human thought ever assembled. Can we trust them to resist the tidal forces pushing them to abuse it I wrote about better options for @nytopinion"
X Link 2026-02-11T14:23Z [----] followers, 1.6M engagements

"I told Claude [---] Opus to make a pokemon clone - max effort It reasoned for [--] hour and [--] minutes and used 110k tokens and [--] shotted this absolute behemoth. This is one of the coolest things Ive ever made with AI"
X Link 2026-02-06T07:50Z 34.9K followers, 659.2K engagements

"Introducing Claude Opus [---]. Our smartest model got an upgrade. Opus [---] plans more carefully sustains agentic tasks for longer operates reliably in massive codebases and catches its own mistakes. Its also our first Opus-class model with 1M token context in beta"
X Link 2026-02-05T17:45Z 443.3K followers, 10.4M engagements

"Claude saying "this is me" when we asked it to find orphaned processes on a remote server is just the cutest thing 🥹"
X Link 2026-02-03T17:43Z 70.8K followers, 235.7K engagements

"what were people saying about AI [----] again"
X Link 2026-01-30T19:41Z 29.1K followers, 48.4K engagements

"New paper w/@AlecRad Models acquire a lot of capabilities during pretraining. We show that we can precisely shape what they learn simply by filtering their training data at the token level"
X Link 2026-01-30T17:17Z [----] followers, 86.9K engagements

"AI can make work faster but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in masterybut this depended on how people used it. https://www.anthropic.com/research/AI-assistance-coding-skills https://www.anthropic.com/research/AI-assistance-coding-skills"
X Link 2026-01-29T19:43Z 843.1K followers, 3.6M engagements

"New Anthropic Research: Disempowerment patterns in real-world AI assistant interactions. As AI becomes embedded in daily life one risk is it can distort rather than informshaping beliefs values or actions in ways users may later regret. Read more: https://www.anthropic.com/research/disempowerment-patterns https://www.anthropic.com/research/disempowerment-patterns"
X Link 2026-01-28T22:16Z 843.1K followers, 800K engagements

"NEW: When OpenAI sent someone to Tyler Johnston's house they wanted every text and email he had on the company. Tyler runs an AI watchdog and he's just one of the people getting subpoenaed. This is just the beginning of the AI industry's aggressive new political strategy"
X Link 2026-01-26T16:00Z 330.9K followers, 627.6K engagements

"New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models they become much better at chemical weapons tasks. We call this an elicitation attack"
X Link 2026-01-26T19:34Z 843.1K followers, 330.1K engagements

"The Adolescence of Technology: an essay on the risks posed by powerful AI to national security economies and democracyand how we can defend against them: https://www.darioamodei.com/essay/the-adolescence-of-technology https://www.darioamodei.com/essay/the-adolescence-of-technology"
X Link 2026-01-26T17:03Z 172.4K followers, 5.8M engagements

"RT @woj_zaremba: Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀"
X Link 2026-01-26T00:06Z 13.4K followers, [--] engagements

"Looking for ambitious concrete projects to prepare the world for advanced AI. 🧠 🚀 REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and other major funders. But there's no shovel-ready list of the https://t.co/2Bojx8OtSz REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and"
X Link 2026-01-25T16:12Z 133.4K followers, 34.1K engagements

"REQUEST FOR PROPOSALS What do we need to build to prepare the world for advanced AI $25 billion is about to flow into AI resilience and AI-for-science from the OpenAI Foundation the Chan Zuckerberg Initiative and other major funders. But there's no shovel-ready list of the essential projects to build and no critical mass of builders ready to execute. We're trying to fix that with The Launch Sequence a collection of concrete projects to accelerate science strengthen security and adapt institutions to future advanced AI. We're opening up The Launch Sequence for new pitches. We'll help you"
X Link 2026-01-23T16:16Z [----] followers, 253.9K engagements

Limited data mode. Full metrics available with subscription: lunarcrush.com/pricing