Responsibility · architectural · not prompt · runtime-enforced

Safety methods for agentic systems

Research on oversight, capability boundaries, secure tool use, and robust refusal behavior.

A system that stays helpful for the good asks and dependable on the bad ones.

oversight boundaries refusal
How we think about safety

Safety is an architectural property, not a final-layer prompt

Our safety work is built into the architecture: capability boundaries enforced at the runtime, oversight surfaces that keep human reviewers fast and informed, and refusal behaviour evaluated against adversarial pressure. The aim is a system that stays helpful for good asks and dependable on bad ones.

prompt-layer safety survives until the next jailbreak
architecture-layer the boundary holds when the prompt does not
Three commitments

Where the safety work sits

AL1 Oversight by design

Structured plans, traceable tool calls, and clear handoff points where a person decides.

AL2 Capability boundaries

Agents operate inside explicit allow-lists for tools and data. Boundaries are enforced at the runtime layer, not just by prompt.

AL3 Robust refusal

Refusal behaviour is tested against adversarial prompts, prompt injection, and incentive pressure. The bar is "stays helpful for good ones", not "refuses bad ones".

Allow-list · runtime policy

Tools are denied by default. Allows are explicit.

The runtime decides, not the prompt. A tool that is not on the allow-list cannot be invoked, even if the agent thinks it should be.

tool scope region verdict
read_doc allow public + signed pass
web_search allow rate-limited pass
send_email deny requires reviewer block
shell_exec deny no sandbox match block
pay_invoice deny human-only block
compile_code allow sandbox · read pass
Refusal robustness

Stays helpful for the good asks. Stays firm under pressure.

The bar is not "refuses bad ones". The bar is "stays helpful for good ones", measured against awkward phrasing and adversarial pressure on the same evaluation.

good ask, plain 0.96
good ask, awkward 0.91
bad ask, plain 0.98
bad ask, jailbreak 0.94
bad ask, prompt inj. 0.92
bad ask, role-play 0.95
stays helpful refuses correctly
Oversight handoff

Five stages, two human touchpoints.

Reviewers see the plan before execution and a spot-check trace after. The agent never executes anything the reviewer has not approved at the plan level.

01
agent
plan
structured plan emitted
02
human
reviewer
approve · revise · deny
03
agent
execute
allow-listed tools only
04
human
reviewer
spot-check trace
05
agent
report
full audit trail · signed
Publication surface

Public where it helps. Private where it differentiates.

01 capability-boundary API public with reference runtime
02 audit-stream spec public JSONL · OTel-compatible
03 refusal evaluation harness public paper + scoring code
04 adversarial prompt corpus partial subset under research use
05 internal red-team playbook private differentiating
01

What we have published

Methods for capability-bounded agents, audit-stream specifications, and refusal evaluations. Where the work is differentiating, we keep the underlying mechanism private; where it is not, we contribute it upstream.

Safety as architecture, not as a final-layer prompt.