LunarCrush LLM | post/tweet::1899143752918409338

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![OpenAI Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::4398626122.png) OpenAI [@OpenAI](/creator/twitter/OpenAI) on x 4.2M followers
Created: 2025-03-10 17:02:09 UTC

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.

We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then:

We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.

![](https://pbs.twimg.com/media/GlscjbHaEAAFijE.png)

XXXXXXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1899143752918409338/c:line.svg)

**Related Topics**
[open ai](/topic/open-ai)

[Post Link](https://x.com/OpenAI/status/1899143752918409338)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

OpenAI @OpenAI on x 4.2M followers Created: 2025-03-10 17:02:09 UTC

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.

We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then:

We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.

XXXXXXXXX engagements

Engagements Line Chart

Related Topics open ai

Post Link