Research tools · live round · 4 heads

Palestra

A debate-and-effort gym where humans and models practise reasoning under live evaluation.

A debate-and-effort gym where humans and models practise reasoning under live evaluation, not on a static benchmark.

format debate · drill · voice · socratic
evaluators coach · critic · effort · socratic
voice loop whisper large-v3 → kokoro v1
What Palestra is

A gym, not a benchmark

Palestra is the cognitive gym we use to study human–AI collaboration. It runs structured rounds — debates, drills, voice exercises — with multi-head evaluators for coaching, critique, effort, and socratic prompting. It is where we measure quality of reasoning under live conditions, not on a static benchmark.

benchmark
  • static prompt set
  • single-shot grade
  • rewards fluency
  • no rebuttal allowed
gym
  • live opponents
  • scored over the trace
  • rewards reasoning that survives a real exchange
  • probe must land on the load-bearing claim
How it works

Four evaluator heads, one gym

Four heads, scored independently. The round score is the trace, not a single number.

P1 Coach

Scores moves on whether they advance the participant's reasoning, not on whether they sound smart.

rewards
moves that advance the reasoning
punishes
moves that sound smart but go nowhere
P2 Critic

Adversarial role that probes the weakest claim in each round.

rewards
pressure on the weakest claim
punishes
broad pressure that misses the load-bearing claim
P3 Effort reward model

Separates effort from outcome so quality of reasoning is graded even when the answer happens to be lucky or unlucky.

rewards
quality of reasoning regardless of luck
punishes
outcome-only thinking — lucky shortcuts
P4 Socratic

Asks the next question instead of giving the next answer. Useful when the gym is being used to train, not evaluate.

rewards
asking the next question instead of answering
punishes
foreclosing the round prematurely
Anatomy of a round

Six moves, four heads, one trace.

Score columns are per-head, per-move. The trace is what the participant takes home.

# who move P1 P2 P3 P4
01 A Opening claim: X causes Y under conditions C. 0.72 0.41 0.66
02 B Counter: prior P shows Y occurs without X. 0.81 0.74 0.70 why C?
03 A Refines: only under C₁ does the causal link hold. 0.78 0.62 0.74
04 B Probes the weakest claim: define C₁ operationally. 0.65 0.88 0.71
05 A Operationalises C₁ with measurable threshold. 0.84 0.79 0.82
06 B Accepts refinement, asks for an out-of-sample test. 0.86 0.83 0.78 OOS?
Voice in the loop

A round you can talk into.

Audio-in, audio-out, end-to-end. The evaluation pipeline does not break when the medium changes.

01
mic
16 kHz capture
+0 ms
02
whisper L3
faster-whisper large-v3
+120 ms
03
evaluator
4-head scoring
+280 ms
04
response
reasoning trace
+720 ms
05
kokoro v1
speech synthesis
+920 ms
round-trip verified · STT and TTS run without breaking the 4-head pipeline
Gym modes

Four formats, same pipeline.

Debate
rounds 2 vs 2
heads P1·P2·P3

structured argument under live scoring

Drill
rounds 1 vs RM
heads P1·P3

short repetitions on a single move

Voice
rounds live
heads P1·P2·P3·P4

audio-in / audio-out end-to-end

Socratic
rounds training
heads P4

asks the next question, never gives the answer

01

Voice in the loop

Palestra has a voice mode wired to a verified speech-to-text and text-to-speech path, so live debate exercises run end to end without breaking the evaluation pipeline.

02

Why it matters

Static benchmarks reward fluency. The gym rewards reasoning that survives a real exchange. We use it to study how teams of humans and models actually collaborate, with the evaluator heads visible to both sides.

A gym for reasoning that survives a real exchange.