Structured plans, traceable tool calls, and clear handoff points where a person decides.
Safety methods for agentic systems
Research on oversight, capability boundaries, secure tool use, and robust refusal behavior.
A system that stays helpful for the good asks and dependable on the bad ones.
Safety is an architectural property, not a final-layer prompt
Our safety work is built into the architecture: capability boundaries enforced at the runtime, oversight surfaces that keep human reviewers fast and informed, and refusal behaviour evaluated against adversarial pressure. The aim is a system that stays helpful for good asks and dependable on bad ones.
Where the safety work sits
Agents operate inside explicit allow-lists for tools and data. Boundaries are enforced at the runtime layer, not just by prompt.
Refusal behaviour is tested against adversarial prompts, prompt injection, and incentive pressure. The bar is "stays helpful for good ones", not "refuses bad ones".
Tools are denied by default. Allows are explicit.
The runtime decides, not the prompt. A tool that is not on the allow-list cannot be invoked, even if the agent thinks it should be.
Stays helpful for the good asks. Stays firm under pressure.
The bar is not "refuses bad ones". The bar is "stays helpful for good ones", measured against awkward phrasing and adversarial pressure on the same evaluation.
Five stages, two human touchpoints.
Reviewers see the plan before execution and a spot-check trace after. The agent never executes anything the reviewer has not approved at the plan level.
Public where it helps. Private where it differentiates.
What we have published
Methods for capability-bounded agents, audit-stream specifications, and refusal evaluations. Where the work is differentiating, we keep the underlying mechanism private; where it is not, we contribute it upstream.