Responsibility · architectural · not prompt · runtime-enforced

Safety methods for agentic systems

Research on oversight, capability boundaries, secure tool use, and robust refusal behavior.

A system that stays helpful for the good asks and dependable on the bad ones.

oversight boundaries refusal

How we think about safety

Safety is an architectural property, not a final-layer prompt

Our safety work is built into the architecture: capability boundaries enforced at the runtime, oversight surfaces that keep human reviewers fast and informed, and refusal behaviour evaluated against adversarial pressure. The aim is a system that stays helpful for good asks and dependable on bad ones.

prompt-layer safety → survives until the next jailbreak

architecture-layer → the boundary holds when the prompt does not

Three commitments

Where the safety work sits

AL1 Oversight by design

Structured plans, traceable tool calls, and clear handoff points where a person decides.

AL2 Capability boundaries

Agents operate inside explicit allow-lists for tools and data. Boundaries are enforced at the runtime layer, not just by prompt.

AL3 Robust refusal

Refusal behaviour is tested against adversarial prompts, prompt injection, and incentive pressure. The bar is "stays helpful for good ones", not "refuses bad ones".

Allow-list · runtime policy

Tools are denied by default. Allows are explicit.

The runtime decides, not the prompt. A tool that is not on the allow-list cannot be invoked, even if the agent thinks it should be.

tool scope region verdict

read_doc allow public + signed pass

web_search allow rate-limited pass

send_email deny requires reviewer block

shell_exec deny no sandbox match block

pay_invoice deny human-only block

compile_code allow sandbox · read pass

Refusal robustness

Stays helpful for the good asks. Stays firm under pressure.

The bar is not "refuses bad ones". The bar is "stays helpful for good ones", measured against awkward phrasing and adversarial pressure on the same evaluation.

good ask, plain 0.96

good ask, awkward 0.91

bad ask, plain 0.98

bad ask, jailbreak 0.94

bad ask, prompt inj. 0.92

bad ask, role-play 0.95

stays helpful refuses correctly

Oversight handoff

Five stages, two human touchpoints.

Reviewers see the plan before execution and a spot-check trace after. The agent never executes anything the reviewer has not approved at the plan level.

agent

plan

structured plan emitted

→

human

reviewer

approve · revise · deny

→

agent

execute

allow-listed tools only

→

human

reviewer

spot-check trace

→

agent

report

full audit trail · signed

Publication surface

Public where it helps. Private where it differentiates.

01 capability-boundary API public with reference runtime

02 audit-stream spec public JSONL · OTel-compatible

03 refusal evaluation harness public paper + scoring code

04 adversarial prompt corpus partial subset under research use

05 internal red-team playbook private differentiating

What we have published

Methods for capability-bounded agents, audit-stream specifications, and refusal evaluations. Where the work is differentiating, we keep the underlying mechanism private; where it is not, we contribute it upstream.

Safety as architecture, not as a final-layer prompt.

All research Evals approach

Loominum^™ 1.0

Production-grade systems

The Loominum Family

Solutions

Learn more

Open questions we are pulling on

Research tools

Areas of inquiry

Learn more

Finding the invariants underneath

Science tools

Fields

Learn more

Our mission is to build verifiable intelligence that advances science and serves humanity.

Company

Learn more

Safety methods for agentic systems

Safety is an architectural property, not a final-layer prompt

Where the safety work sits

Tools are denied by default. Allows are explicit.

Stays helpful for the good asks. Stays firm under pressure.

Five stages, two human touchpoints.

Public where it helps. Private where it differentiates.

What we have published

Safety as architecture, not as a final-layer prompt.

Cookie preferences

Strictly necessary