Research tools · live round · 4 heads

Palestra

A debate-and-effort gym where humans and models practise reasoning under live evaluation.

A debate-and-effort gym where humans and models practise reasoning under live evaluation, not on a static benchmark.

format debate · drill · voice · socratic

evaluators coach · critic · effort · socratic

voice loop whisper large-v3 → kokoro v1

round · #2046 live

participant A claim

participant B probe

P1 Coach

P2 Critic

P3 Effort RM

P4 Socratic

01 A Opening claim: X causes Y under conditions C. 0.72

02 B Counter: prior P shows Y occurs without X. 0.81

03 A Refines: only under C₁ does the causal link hold. 0.78

04 B Probes the weakest claim: define C₁ operationally. 0.65

What Palestra is

A gym, not a benchmark

Palestra is the cognitive gym we use to study human–AI collaboration. It runs structured rounds — debates, drills, voice exercises — with multi-head evaluators for coaching, critique, effort, and socratic prompting. It is where we measure quality of reasoning under live conditions, not on a static benchmark.

benchmark

static prompt set
single-shot grade
rewards fluency
no rebuttal allowed

gym

live opponents
scored over the trace
rewards reasoning that survives a real exchange
probe must land on the load-bearing claim

How it works

Four evaluator heads, one gym

Four heads, scored independently. The round score is the trace, not a single number.

P1 Coach

Scores moves on whether they advance the participant's reasoning, not on whether they sound smart.

rewards: moves that advance the reasoning
punishes: moves that sound smart but go nowhere

P2 Critic

Adversarial role that probes the weakest claim in each round.

rewards: pressure on the weakest claim
punishes: broad pressure that misses the load-bearing claim

P3 Effort reward model

Separates effort from outcome so quality of reasoning is graded even when the answer happens to be lucky or unlucky.

rewards: quality of reasoning regardless of luck
punishes: outcome-only thinking — lucky shortcuts

P4 Socratic

Asks the next question instead of giving the next answer. Useful when the gym is being used to train, not evaluate.

rewards: asking the next question instead of answering
punishes: foreclosing the round prematurely

Anatomy of a round

Six moves, four heads, one trace.

Score columns are per-head, per-move. The trace is what the participant takes home.

# who move P1 P2 P3 P4

01 A Opening claim: X causes Y under conditions C. 0.72 0.41 0.66 –

02 B Counter: prior P shows Y occurs without X. 0.81 0.74 0.70 why C?

03 A Refines: only under C₁ does the causal link hold. 0.78 0.62 0.74 –

04 B Probes the weakest claim: define C₁ operationally. 0.65 0.88 0.71 –

05 A Operationalises C₁ with measurable threshold. 0.84 0.79 0.82 –

06 B Accepts refinement, asks for an out-of-sample test. 0.86 0.83 0.78 OOS?

Voice in the loop

A round you can talk into.

Audio-in, audio-out, end-to-end. The evaluation pipeline does not break when the medium changes.

mic

16 kHz capture

+0 ms

whisper L3

faster-whisper large-v3

+120 ms

evaluator

4-head scoring

+280 ms

response

reasoning trace

+720 ms

kokoro v1

speech synthesis

+920 ms

round-trip verified · STT and TTS run without breaking the 4-head pipeline

Gym modes

Four formats, same pipeline.

Debate

rounds 2 vs 2

heads P1·P2·P3

structured argument under live scoring

Drill

rounds 1 vs RM

heads P1·P3

short repetitions on a single move

Voice

rounds live

heads P1·P2·P3·P4

audio-in / audio-out end-to-end

Socratic

rounds training

heads P4

asks the next question, never gives the answer

Voice in the loop

Palestra has a voice mode wired to a verified speech-to-text and text-to-speech path, so live debate exercises run end to end without breaking the evaluation pipeline.

Why it matters

Static benchmarks reward fluency. The gym rewards reasoning that survives a real exchange. We use it to study how teams of humans and models actually collaborate, with the evaluator heads visible to both sides.

A gym for reasoning that survives a real exchange.

All research Evals approach

Loominum^™ 1.0

Production-grade systems

The Loominum Family

Solutions

Learn more

Open questions we are pulling on

Research tools

Areas of inquiry

Learn more

Finding the invariants underneath

Science tools

Fields

Learn more

Our mission is to build verifiable intelligence that advances science and serves humanity.

Company

Learn more

Palestra

A gym, not a benchmark

Four evaluator heads, one gym

Six moves, four heads, one trace.

A round you can talk into.

Four formats, same pipeline.

Voice in the loop

Why it matters

A gym for reasoning that survives a real exchange.

Cookie preferences

Strictly necessary