LunarCrush LLM | post/tweet::1944778251064496579

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![omarsar0 Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::3448284313.png) elvis [@omarsar0](/creator/twitter/omarsar0) on x 256K followers
Created: 2025-07-14 15:17:21 UTC

Inference-time tricks fail to help

CoT prompting and majority voting do not reliably reduce vulnerability, and sometimes make FPR worse, especially for Qwen models on math tasks.

LLMs continue to show all kinds of weird vulnerabilities, and this is just one of the latest results to be published. This highlights the importance of building robust LLM-based evaluation strategies.

Paper:

![](https://pbs.twimg.com/media/Gv09g2LaAAAL24M.png)

XXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1944778251064496579/c:line.svg)

**Related Topics**
[robust](/topic/robust)
[qwen](/topic/qwen)
[elvis](/topic/elvis)

[Post Link](https://x.com/omarsar0/status/1944778251064496579)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

elvis @omarsar0 on x 256K followers Created: 2025-07-14 15:17:21 UTC

Inference-time tricks fail to help

CoT prompting and majority voting do not reliably reduce vulnerability, and sometimes make FPR worse, especially for Qwen models on math tasks.

LLMs continue to show all kinds of weird vulnerabilities, and this is just one of the latest results to be published. This highlights the importance of building robust LLM-based evaluation strategies.

Paper:

XXXXX engagements

Engagements Line Chart

Related Topics robust qwen elvis

Post Link