RL-L1:臨床SOTAにわずかなサイズで並ぶ連続時間モデル
実際のICUデータにおいて、本連続時間モデルは標準的な臨床ベースラインに匹敵しながら、はるかに小型・高速で動作します——時間を後付けではなく入力として扱う根拠を示します。
Almost every model in production today reads a sequence as if every step were one tick of a clock. Reality is not so tidy. A patient’s vitals arrive in bursts and silences; a sensor drops out and comes back; a market prints ten events in a second and then nothing for a minute. The gap between two observations is itself information — and most architectures throw it away.
RL-L1 is the first generation of our continuous-time line. It is built around a single conviction: the interval between two observations should change the computation. Timing is part of the input, not metadata bolted on afterwards. The result is a model that behaves correctly across irregular gaps, missing samples and live streams, and that is small and fast enough to run where a large attention model simply does not fit.
The result on real clinical data
We benchmarked RL-L1 on PhysioNet 2012 in-hospital mortality — the canonical irregular-time-series task, built from the real, irregularly-sampled vitals of roughly four thousand ICU patients. It is the benchmark the GRU-D and latent-ODE literature is measured against, so it is the fair place to make a claim.
RL-L1 reaches 0.875 AUC as an 18-seed ensemble, with a 95% confidence interval of [0.868, 0.883]. That is a statistical tie with the strong GRU-D baseline at 0.874 — and it is reached with roughly five times fewer parameters than a fairly tuned Transformer, which lands at 0.77 on the same encoding. Matching the established clinical baseline is the honest framing; beating it outright is not something the interval supports, and we will not claim it.
Timing is doing the work
The cleanest evidence that a model uses a signal is what happens when you remove it. Zero out the timing — feed RL-L1 the same observations on a regular clock — and accuracy collapses toward chance. The architecture is not getting its result from the values alone; it is getting it from when those values arrived. That ablation is the load-bearing experiment behind the whole line.
A model that quietly ignores time will look fine on a benchmark and fail at the bedside. We would rather measure the thing that matters.
RL-L1 design note
Small, fast, and built for the edge
The same continuous-time front-end generalises beyond healthcare. On a frequency-sensitive detection task — spotting camcorder-style screen capture from the beat of a display refresh — a resonant variant of RL-L1 reaches 0.998 AUC with 2.7 times fewer parameters than the attention baseline it matches.
That efficiency is the point. On real hardware the runtime kernel runs 8.2 times faster than the reference loop at exact numerical parity, and the deployed model fits in tens of kilobytes. RL-L1 is designed for routers, set-top boxes, wearables and sensors — places where every parameter and every millijoule is counted, and where timing is usually the whole signal.
Where it does not win — and why we say so
RL-L1 is not a frontier-scale language model, and we do not present it as one. At large parameter counts the efficiency edge fades and a Transformer pulls ahead; on language, which has no irregular-time structure to exploit, the Transformer simply wins. We publish those negatives next to the positives on purpose.
Every claim here sits behind a pass/fail test with a threshold fixed before the run: a continuous-time model only “wins” if it wins where timing matters and loses honestly where it does not, and a clinical number only counts on a held-out, leakage-safe split. That discipline is what lets the 0.875 be trusted.
What comes next
RL-L1 is in research preview, open to selected partners working on clinical monitoring, sensor intelligence and edge deployment. It is also the model half of a longer bet: a substrate program where the physics of the silicon is designed to reproduce the model’s continuous-time flow exactly, so computation and hardware converge on the same operator. RL-L1 is where that program first meets a usable model.