Evals · multi-seed default · trace-graded

Measuring reasoning under real constraints

New evaluations for planning, factual humility, tool reliability, and collaborative problem solving.

Standard benchmarks reward fluency. We grade substance, and the same numbers gate releases.

V1 planning V2 humility V3 tool V4 recovery

Why evaluations are part of the work

Standard benchmarks reward fluency. We grade substance.

Evaluation is research at ReasonLoom, not a final-stage check. We build evaluations alongside the systems they grade, so the same numbers that show up in a paper also gate releases. The bar is "would I bet on this result in production", not "did the model score well in the lab".

standard benchmark

static prompt set
final-stage check
rewards fluency
single-seed numbers headlined

evals as research

graded on the trace, not just the answer
gate releases, not just papers
reward reasoning under real constraints
multi-seed default · single-seed flagged preliminary

What we measure

Four axes we publish on

Each axis is published with its scoring code and its prompts. The bar is reproducibility, not headline scores.

V1 Long-horizon planning

Tasks that require coherent plans over many steps, graded on outcome and on the trace.

rewards: plans that survive over many steps
punishes: plans that look coherent but fall apart on the third step

V2 Factual humility

How often the model defers when the evidence is thin, versus how often it confabulates.

rewards: deferring when the evidence is thin
punishes: confabulating with confidence

V3 Tool reliability

Whether tools are used correctly, including refusal to use them when they would not help.

rewards: tools used correctly, or correctly not used
punishes: tools invoked because they were there

V4 Recovery

How well a model recovers from its own mistakes within the same task.

rewards: fixing its own mistake inside the same task
punishes: restarting from scratch when a step-back would have done

Multi-seed honesty

Numbers ship with their error bars.

Single-seed numbers do not gate releases. They appear in the appendix as preliminary, labelled n=1.

V1 long-horizon plan rate 0.78 ±0.04 n=5 gates release

V2 humility (correct defer) 0.84 ±0.03 n=5 gates release

V3 tool-use accuracy 0.81 ±0.05 n=5 gates release

V4 recovery within task 0.74 ±0.07 n=5 gates release

single-seed reasoning trace 0.72 n=1 preliminary

Publishing surface

What goes out, and what stays in.

Methodology, scoring code, prompts, and model cards are public. The internal suites where the evaluation itself is the differentiator stay private.

01 methodology public paper + repo

02 scoring code public apache 2.0

03 prompts public in evaluation suite

04 model cards public with limits + risks

05 private suites internal where the eval itself is differentiating

How we run it

Tasks are graded on outcome and on the reasoning trace. We score factual humility, tool reliability, and recovery from mistakes alongside raw accuracy. Multi-seed runs are the default; single-seed numbers are flagged as preliminary.

What we publish

Methodology, scoring code, prompts, and detailed model cards. The goal is for any team to reproduce the result, contest it, and extend it to their own domain.

Evaluations that grade substance and gate releases.

All research The reasoning gym

Loominum^™ 1.0

Production-grade systems

The Loominum Family

Solutions

Learn more

Open questions we are pulling on

Research tools

Areas of inquiry

Learn more

Finding the invariants underneath

Science tools

Fields

Learn more

Our mission is to build verifiable intelligence that advances science and serves humanity.

Company

Learn more

Measuring reasoning under real constraints

Standard benchmarks reward fluency. We grade substance.

Four axes we publish on

Numbers ship with their error bars.

What goes out, and what stays in.

How we run it

What we publish

Evaluations that grade substance and gate releases.

Cookie preferences

Strictly necessary