Tasks that require coherent plans over many steps, graded on outcome and on the trace.
Measuring reasoning under real constraints
New evaluations for planning, factual humility, tool reliability, and collaborative problem solving.
Standard benchmarks reward fluency. We grade substance, and the same numbers gate releases.
Standard benchmarks reward fluency. We grade substance.
Evaluation is research at ReasonLoom, not a final-stage check. We build evaluations alongside the systems they grade, so the same numbers that show up in a paper also gate releases. The bar is "would I bet on this result in production", not "did the model score well in the lab".
- static prompt set
- final-stage check
- rewards fluency
- single-seed numbers headlined
- graded on the trace, not just the answer
- gate releases, not just papers
- reward reasoning under real constraints
- multi-seed default · single-seed flagged preliminary
Four axes we publish on
Each axis is published with its scoring code and its prompts. The bar is reproducibility, not headline scores.
How often the model defers when the evidence is thin, versus how often it confabulates.
Whether tools are used correctly, including refusal to use them when they would not help.
How well a model recovers from its own mistakes within the same task.
Numbers ship with their error bars.
Single-seed numbers do not gate releases. They appear in the appendix as preliminary, labelled n=1.
What goes out, and what stays in.
Methodology, scoring code, prompts, and model cards are public. The internal suites where the evaluation itself is the differentiator stay private.
How we run it
Tasks are graded on outcome and on the reasoning trace. We score factual humility, tool reliability, and recovery from mistakes alongside raw accuracy. Multi-seed runs are the default; single-seed numbers are flagged as preliminary.
What we publish
Methodology, scoring code, prompts, and detailed model cards. The goal is for any team to reproduce the result, contest it, and extend it to their own domain.