Evals · multi-seed default · trace-graded

Measuring reasoning under real constraints

New evaluations for planning, factual humility, tool reliability, and collaborative problem solving.

Standard benchmarks reward fluency. We grade substance, and the same numbers gate releases.

V1 planning V2 humility V3 tool V4 recovery
Why evaluations are part of the work

Standard benchmarks reward fluency. We grade substance.

Evaluation is research at ReasonLoom, not a final-stage check. We build evaluations alongside the systems they grade, so the same numbers that show up in a paper also gate releases. The bar is "would I bet on this result in production", not "did the model score well in the lab".

standard benchmark
  • static prompt set
  • final-stage check
  • rewards fluency
  • single-seed numbers headlined
evals as research
  • graded on the trace, not just the answer
  • gate releases, not just papers
  • reward reasoning under real constraints
  • multi-seed default · single-seed flagged preliminary
What we measure

Four axes we publish on

Each axis is published with its scoring code and its prompts. The bar is reproducibility, not headline scores.

V1 Long-horizon planning

Tasks that require coherent plans over many steps, graded on outcome and on the trace.

rewards
plans that survive over many steps
punishes
plans that look coherent but fall apart on the third step
V2 Factual humility

How often the model defers when the evidence is thin, versus how often it confabulates.

rewards
deferring when the evidence is thin
punishes
confabulating with confidence
V3 Tool reliability

Whether tools are used correctly, including refusal to use them when they would not help.

rewards
tools used correctly, or correctly not used
punishes
tools invoked because they were there
V4 Recovery

How well a model recovers from its own mistakes within the same task.

rewards
fixing its own mistake inside the same task
punishes
restarting from scratch when a step-back would have done
Multi-seed honesty

Numbers ship with their error bars.

Single-seed numbers do not gate releases. They appear in the appendix as preliminary, labelled n=1.

V1 long-horizon plan rate 0.78 ±0.04 n=5 gates release
V2 humility (correct defer) 0.84 ±0.03 n=5 gates release
V3 tool-use accuracy 0.81 ±0.05 n=5 gates release
V4 recovery within task 0.74 ±0.07 n=5 gates release
single-seed reasoning trace 0.72 n=1 preliminary
Publishing surface

What goes out, and what stays in.

Methodology, scoring code, prompts, and model cards are public. The internal suites where the evaluation itself is the differentiator stay private.

01 methodology public paper + repo
02 scoring code public apache 2.0
03 prompts public in evaluation suite
04 model cards public with limits + risks
05 private suites internal where the eval itself is differentiating
01

How we run it

Tasks are graded on outcome and on the reasoning trace. We score factual humility, tool reliability, and recovery from mistakes alongside raw accuracy. Multi-seed runs are the default; single-seed numbers are flagged as preliminary.

02

What we publish

Methodology, scoring code, prompts, and detailed model cards. The goal is for any team to reproduce the result, contest it, and extend it to their own domain.

Evaluations that grade substance and gate releases.