[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.] [@MariusHobbhahn](/creator/twitter/MariusHobbhahn) "The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily" [X Link](https://x.com/MariusHobbhahn/status/1980589363621949910) [@MariusHobbhahn](/creator/x/MariusHobbhahn) 2025-10-21T10:57Z 5373 followers, 14.8K engagements "I originally missed that these X eval awareness features are among the XX most changing features OVERALL between before and after post-training. I feel like that should inspire less confidence in the safety pipeline and the strong alignment scores. Important to study further" [X Link](https://x.com/MariusHobbhahn/status/1981026308785070287) [@MariusHobbhahn](/creator/x/MariusHobbhahn) 2025-10-22T15:54Z 5373 followers, 6966 engagements "We're hiring for Research Scientists / Engineers - We closely work with all frontier labs - We're a small org and can move fast - We can choose our own agenda and what we publish We're especially looking for people who enjoy fast empirical research. Deadline: XX Oct" [X Link](https://x.com/MariusHobbhahn/status/1981377022841483354) [@MariusHobbhahn](/creator/x/MariusHobbhahn) 2025-10-23T15:07Z 5373 followers, 66.1K engagements "I understand that consensus-driven scientific work can be challenging and I appreciate that it adheres to high scientific standards. I also think the report is net positive. However I think the level of hedging and caveats provided here hinders its stated goal of accurately informing key decision makers. The worries come from future systems so if you just report "current models potentially maybe show signs of risk sometimes" you are in fact not informing people about the most important part of the problem. I think most decision-makers with little context just round this down to "ok so not a" [X Link](https://x.com/MariusHobbhahn/status/1981379090058711322) [@MariusHobbhahn](/creator/x/MariusHobbhahn) 2025-10-23T15:16Z 5373 followers, 11K engagements
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
@MariusHobbhahn
"The Sonnet-4.5 system card section on white-box testing for eval awareness (7.6.4) might have been the first time that interpretability was used - on a frontier model before deployment - answered an important question - couldn't have been answered with black box as easily"
X Link @MariusHobbhahn 2025-10-21T10:57Z 5373 followers, 14.8K engagements
"I originally missed that these X eval awareness features are among the XX most changing features OVERALL between before and after post-training. I feel like that should inspire less confidence in the safety pipeline and the strong alignment scores. Important to study further"
X Link @MariusHobbhahn 2025-10-22T15:54Z 5373 followers, 6966 engagements
"We're hiring for Research Scientists / Engineers - We closely work with all frontier labs - We're a small org and can move fast - We can choose our own agenda and what we publish We're especially looking for people who enjoy fast empirical research. Deadline: XX Oct"
X Link @MariusHobbhahn 2025-10-23T15:07Z 5373 followers, 66.1K engagements
"I understand that consensus-driven scientific work can be challenging and I appreciate that it adheres to high scientific standards. I also think the report is net positive. However I think the level of hedging and caveats provided here hinders its stated goal of accurately informing key decision makers. The worries come from future systems so if you just report "current models potentially maybe show signs of risk sometimes" you are in fact not informing people about the most important part of the problem. I think most decision-makers with little context just round this down to "ok so not a"
X Link @MariusHobbhahn 2025-10-23T15:16Z 5373 followers, 11K engagements
/creator/twitter::1012809976224641024/posts