Scores moves on whether they advance the participant's reasoning, not on whether they sound smart.
Palestra
A debate-and-effort gym where humans and models practise reasoning under live evaluation.
A debate-and-effort gym where humans and models practise reasoning under live evaluation, not on a static benchmark.
A gym, not a benchmark
Palestra is the cognitive gym we use to study human–AI collaboration. It runs structured rounds — debates, drills, voice exercises — with multi-head evaluators for coaching, critique, effort, and socratic prompting. It is where we measure quality of reasoning under live conditions, not on a static benchmark.
- static prompt set
- single-shot grade
- rewards fluency
- no rebuttal allowed
- live opponents
- scored over the trace
- rewards reasoning that survives a real exchange
- probe must land on the load-bearing claim
Four evaluator heads, one gym
Four heads, scored independently. The round score is the trace, not a single number.
Adversarial role that probes the weakest claim in each round.
Separates effort from outcome so quality of reasoning is graded even when the answer happens to be lucky or unlucky.
Asks the next question instead of giving the next answer. Useful when the gym is being used to train, not evaluate.
Six moves, four heads, one trace.
Score columns are per-head, per-move. The trace is what the participant takes home.
A round you can talk into.
Audio-in, audio-out, end-to-end. The evaluation pipeline does not break when the medium changes.
Four formats, same pipeline.
structured argument under live scoring
short repetitions on a single move
audio-in / audio-out end-to-end
asks the next question, never gives the answer
Voice in the loop
Palestra has a voice mode wired to a verified speech-to-text and text-to-speech path, so live debate exercises run end to end without breaking the evaluation pipeline.
Why it matters
Static benchmarks reward fluency. The gym rewards reasoning that survives a real exchange. We use it to study how teams of humans and models actually collaborate, with the evaluator heads visible to both sides.