Systems · multi-region · resumes through faults

Resilient training across distributed clusters

Techniques for keeping frontier training stable when compute, data, and teams span continents.

Stability, reproducibility, and recovery are properties of the run. We design for them, not around them.

absorb faults deterministic data reproducible recovery

Why training is research too

Frontier training is a systems problem with a research budget

Stability, reproducibility, and recovery are properties of a training run. Treating them as systems problems instead of engineering chores is the difference between a finished model and a half-finished one. Our distributed-training work is published where it is not differentiating, so smaller labs can build on the same foundation.

Three properties we design for

What "resilient" means in practice

D1 Stability across heterogeneous compute

Hardware faults, network jitter, and partial failures are absorbed without restarting from scratch.

D2 Deterministic data

Versioned datasets, deterministic loading, and checkpoints that capture both weights and the training context that produced them.

D3 Reproducible recovery

Restoring from a checkpoint reproduces the same trajectory under the same conditions.

Multi-region cluster

Compute, data, and teams span continents.

A live run holds across four regions and ~46 nodes. The runtime treats partial failure as a re-scheduling problem, not a restart event.

46 total nodes

4 regions

1 live fault absorbed

0 restarts

What the runtime absorbs

Four classes of failure, none of which restart the run.

hardware fault

GPU SXM link drop

replicate · resume · continue

no restart

network jitter

cross-region latency spike

gradient backpressure · scheduler reslot

no restart

partial cluster loss

EU-west rack power event

shard reweight · 2 region failover

no restart

data shard skew

one shard yields NaN

shard quarantine · resample

no restart

Checkpoint anatomy

Weights are not enough.

A checkpoint captures weights and the context that produced them. Without that context, a restart is a guess.

weights tensor 1.2 TB

optimizer state tensor 480 GB

rng seeds context 48 KB

data offset context 8 KB

config hash context 64 B

cluster topology context 12 KB

commit sha provenance 40 B

all seven fields written together · loaded together

Open infrastructure

Where the work is not differentiating, we contribute it upstream.

01 fault-tolerant scheduler upstream contributed to ray + nccl ecosystems

02 deterministic data loader upstream sharded streaming · pinned offsets

03 checkpoint format + context upstream spec + reference reader

04 gradient-aware re-slotting internal differentiating · in-house

05 cross-region training runbook internal differentiating · in-house

Open infrastructure work

We contribute the parts of this stack that are not differentiating to upstream open-source projects. The differentiating parts stay in-house.

Training as a systems problem with a research budget.

All research The orchestrator

Loominum^™ 1.0

Production-grade systems

The Loominum Family

Solutions

Learn more

Open questions we are pulling on

Research tools

Areas of inquiry

Learn more

Finding the invariants underneath

Science tools

Fields

Learn more

Our mission is to build verifiable intelligence that advances science and serves humanity.

Company

Learn more

Resilient training across distributed clusters

Frontier training is a systems problem with a research budget

What "resilient" means in practice

Compute, data, and teams span continents.

Four classes of failure, none of which restart the run.

Weights are not enough.

Where the work is not differentiating, we contribute it upstream.

Open infrastructure work

Training as a systems problem with a research budget.

Cookie preferences

Strictly necessary