[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]  elvis [@omarsar0](/creator/twitter/omarsar0) on x 254.8K followers Created: 2025-07-14 15:17:21 UTC Inference-time tricks fail to help CoT prompting and majority voting do not reliably reduce vulnerability, and sometimes make FPR worse, especially for Qwen models on math tasks. LLMs continue to show all kinds of weird vulnerabilities, and this is just one of the latest results to be published. This highlights the importance of building robust LLM-based evaluation strategies. Paper:  XXXXX engagements  **Related Topics** [elvis](/topic/elvis) [Post Link](https://x.com/omarsar0/status/1944778251064496579)
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]
elvis @omarsar0 on x 254.8K followers
Created: 2025-07-14 15:17:21 UTC
Inference-time tricks fail to help
CoT prompting and majority voting do not reliably reduce vulnerability, and sometimes make FPR worse, especially for Qwen models on math tasks.
LLMs continue to show all kinds of weird vulnerabilities, and this is just one of the latest results to be published. This highlights the importance of building robust LLM-based evaluation strategies.
Paper:
XXXXX engagements
Related Topics elvis
/post/tweet::1944778251064496579