Addestrare i modelli come bambini supera la distillazione in efficienza campionaria
Un trainer evolutivo che radica, corregge, lega e consolida apprende i concetti da molte meno esposizioni, e non li dimentica.
You cannot copy a connection from one brain into another, and no parent ever did. When a child learns a first word, they build their own connection from scratch; what makes the meaning converge is not the mechanism but the conditions — a shared world, thousands of closed-loop corrections, a persistent model of that world, real goals, real stakes.
Plain teacher-to-student distillation has none of that. It transfers the teacher’s output tokens, the surface form, not the grounding behind it — which is exactly why a student can match a teacher token-for-token on a corpus and still mean nothing by it. Atelier reproduces the conditions of convergence instead of copying the connections. It is a developmental trainer: a model perceives, produces, is corrected, binds the success into memory, and consolidates.
Two timescales, measured
The spine of the design is a complementary-learning-systems schedule: a fast, one-shot episodic store feeding a slow, generalising backbone. The fast store binds a grounded episode once, at high fidelity; the slow weights learn the invariant by replaying that episode internally, rather than by external repetition.
Measured against a single-timescale baseline, the schedule reaches competence in roughly half the exposures — a 1.9 to 2.0× sample-efficiency gain. The decisive control: switch the fast store off after upbringing, and the slow weights still name held-out, sensor-noised instances. Consolidation built durable competence, not a lookup table.
Lifelong, not amnesiac
A learner that improves by forgetting is not improving. Across multiple seeds, consolidating runs retain every prior world they were reared in, where amnesiac controls forget catastrophically. And the structure pays off where it should: slot-factored relational binding generalises to held-out role-swaps — “the dog chases the cat” versus “the cat chases the dog” — at 0.96 accuracy, against 0.17 for a flat byte-level memory. That is composition the model was never trained on.
Honest about the cause
There is a romantic version of this story in which a brand-new architecture is what makes grounding happen. We tested it, and we falsified it. A plain Transformer reared through the same developmental loop ties on naming and on continuity. The durable advantage is the rearing method and the training objective — grounding instead of shortcut — not the backbone.
A result that wins everywhere is usually graded dishonestly. We publish the axis where our method wins, and the one where the architecture does not.
Atelier design note
Why it matters
Sample efficiency, lifelong retention and grounded composition are exactly the properties you want when training is expensive, data is scarce, and a model has to keep learning after it ships. Atelier is the curriculum that turns the substrate and the structured memory into a usable model — and the place where we measure, honestly, whether a curriculum produces grounded behaviour or just a convincing imitation of it.