Systems · multi-region · resumes through faults

Resilient training across distributed clusters

Techniques for keeping frontier training stable when compute, data, and teams span continents.

Stability, reproducibility, and recovery are properties of the run. We design for them, not around them.

absorb faults deterministic data reproducible recovery
Why training is research too

Frontier training is a systems problem with a research budget

Stability, reproducibility, and recovery are properties of a training run. Treating them as systems problems instead of engineering chores is the difference between a finished model and a half-finished one. Our distributed-training work is published where it is not differentiating, so smaller labs can build on the same foundation.

Three properties we design for

What "resilient" means in practice

D1 Stability across heterogeneous compute

Hardware faults, network jitter, and partial failures are absorbed without restarting from scratch.

D2 Deterministic data

Versioned datasets, deterministic loading, and checkpoints that capture both weights and the training context that produced them.

D3 Reproducible recovery

Restoring from a checkpoint reproduces the same trajectory under the same conditions.

Multi-region cluster

Compute, data, and teams span continents.

A live run holds across four regions and ~46 nodes. The runtime treats partial failure as a re-scheduling problem, not a restart event.

46 total nodes
4 regions
1 live fault absorbed
0 restarts
What the runtime absorbs

Four classes of failure, none of which restart the run.

01
hardware fault
GPU SXM link drop
replicate · resume · continue
no restart
02
network jitter
cross-region latency spike
gradient backpressure · scheduler reslot
no restart
03
partial cluster loss
EU-west rack power event
shard reweight · 2 region failover
no restart
04
data shard skew
one shard yields NaN
shard quarantine · resample
no restart
Checkpoint anatomy

Weights are not enough.

A checkpoint captures weights and the context that produced them. Without that context, a restart is a guess.

weights tensor 1.2 TB
optimizer state tensor 480 GB
rng seeds context 48 KB
data offset context 8 KB
config hash context 64 B
cluster topology context 12 KB
commit sha provenance 40 B
all seven fields written together · loaded together
Open infrastructure

Where the work is not differentiating, we contribute it upstream.

01 fault-tolerant scheduler upstream contributed to ray + nccl ecosystems
02 deterministic data loader upstream sharded streaming · pinned offsets
03 checkpoint format + context upstream spec + reference reader
04 gradient-aware re-slotting internal differentiating · in-house
05 cross-region training runbook internal differentiating · in-house
01

Open infrastructure work

We contribute the parts of this stack that are not differentiating to upstream open-source projects. The differentiating parts stay in-house.

Training as a systems problem with a research budget.