Teaching AI Agents When to Say "No" Before They Break Something
AI Agents Need to Learn When to Say "No"
TL;DR — When AI agents can use tools and take actions, teaching them to refuse unsafe requests is just as important as teaching them to complete tasks. A new training method cuts harmful behavior by 50% while keeping helpful tasks running smoothly.
What It Is
Imagine giving an AI assistant access to your files, email, and payment systems. Unlike chatbots that just generate text, these "agentic" models take real actions that can't be undone. The problem? Current safety training focuses on refusing bad requests in conversation, but breaks down when those same requests are split across multiple steps of tool use.
MOSAIC introduces a simple loop: plan → check → act or refuse. Before executing any tool action, the model runs an explicit safety check (logged as <safety_thoughts>) that evaluates harm potential, irreversibility, and whether recent tool responses reveal hidden risks. Refusal becomes a legitimate action the agent can take, not just a failure mode. The researchers trained this behavior using pairwise comparisons—showing the model two different ways to handle the same task and teaching it which approach was safer—rather than just scoring individual outcomes as pass/fail.
Testing across three different model families (Qwen and Phi models ranging from 4B to 7B parameters), MOSAIC reduced harmful actions by up to 50%, increased refusal of malicious requests by over 20% during prompt injection attacks, and actually improved performance on legitimate tasks.
Why It Matters
- Multi-step attacks slip through single-turn defenses: An agent might refuse "delete all customer data" in chat, but comply when the same goal is disguised across five tool calls. Safety needs to work at the trajectory level, not just the response level.
- Small models are cost-effective but vulnerable: The 4-7B models tested here are what most teams actually deploy for latency and cost reasons, but they're more susceptible to adversarial instructions than frontier models. This shows you can harden them without sacrificing utility.
- Explicit safety checks beat implicit reasoning: Long chain-of-thought traces don't automatically include safety considerations. Making safety reasoning a separate, structured step (that you can log and audit) is more reliable than hoping the model thinks about safety somewhere in its reasoning.
One Thing to Try
If you're building an agent system, add a structured safety checkpoint between planning and execution. Before any tool call that modifies state (writes files, sends messages, makes purchases), require the model to output a brief safety assessment covering: (1) potential for harm, (2) whether the action is reversible, and (3) whether it matches the user's original intent. Log these assessments separately—they're your audit trail when things go wrong.