Dark | Light
[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

![Teknium1 Avatar](https://lunarcrush.com/gi/w:24/cr:twitter::1365020011123773442.png) Teknium (e/λ) [@Teknium1](/creator/twitter/Teknium1) on x 48.3K followers
Created: 2025-07-17 19:22:09 UTC

Just merged a PR for an environment to improve LLM as a Judge as well as evaluate models on their capability of doing judgements!

Did you know that all verifiable RL environments are nearly equivalent to benchmarks (and vice-versa!)? So we added an evaluate command to Atropos' base and now you can run benchmarks through Atropos environments.

We got frustrated with working with so many benchmark frameworks that were outdated or unusable, so we implemented evaluation-only mode into Atropos, our RL environments framework.

So our first port from outside our existing environments was @natolambert's Reward-Bench! 

Note: it only supports generative reward models (regular LLM Judges) at the moment.

Check out the PR here:

![](https://pbs.twimg.com/media/GwFRj2pWUAMrmhq.jpg)

XXXXXX engagements

![Engagements Line Chart](https://lunarcrush.com/gi/w:600/p:tweet::1945927019281478051/c:line.svg)

**Related Topics**
[rl](/topic/rl)
[llm](/topic/llm)

[Post Link](https://x.com/Teknium1/status/1945927019281478051)

[GUEST ACCESS MODE: Data is scrambled or limited to provide examples. Make requests using your API key to unlock full data. Check https://lunarcrush.ai/auth for authentication information.]

Teknium1 Avatar Teknium (e/λ) @Teknium1 on x 48.3K followers Created: 2025-07-17 19:22:09 UTC

Just merged a PR for an environment to improve LLM as a Judge as well as evaluate models on their capability of doing judgements!

Did you know that all verifiable RL environments are nearly equivalent to benchmarks (and vice-versa!)? So we added an evaluate command to Atropos' base and now you can run benchmarks through Atropos environments.

We got frustrated with working with so many benchmark frameworks that were outdated or unusable, so we implemented evaluation-only mode into Atropos, our RL environments framework.

So our first port from outside our existing environments was @natolambert's Reward-Bench!

Note: it only supports generative reward models (regular LLM Judges) at the moment.

Check out the PR here:

XXXXXX engagements

Engagements Line Chart

Related Topics rl llm

Post Link

post/tweet::1945927019281478051
/post/tweet::1945927019281478051