distilled

Teaching AI Agents When to Say "No" Before They Break Something

Santthosh Selvadurai

06 Mar 2026 — 2 min read

AI Agents Need to Learn When to Say "No"

TL;DR — When AI agents can use tools and take actions, teaching them to refuse unsafe requests is just as important as teaching them to complete tasks. A new training method cuts harmful behavior by 50% while keeping helpful tasks running smoothly.

What It Is

Imagine giving an AI assistant access to your files, email, and payment systems. Unlike chatbots that just generate text, these "agentic" models take real actions that can't be undone. The problem? Current safety training focuses on refusing bad requests in conversation, but breaks down when those same requests are split across multiple steps of tool use.

MOSAIC introduces a simple loop: plan → check → act or refuse. Before executing any tool action, the model runs an explicit safety check (logged as <safety_thoughts>) that evaluates harm potential, irreversibility, and whether recent tool responses reveal hidden risks. Refusal becomes a legitimate action the agent can take, not just a failure mode. The researchers trained this behavior using pairwise comparisons—showing the model two different ways to handle the same task and teaching it which approach was safer—rather than just scoring individual outcomes as pass/fail.

Testing across three different model families (Qwen and Phi models ranging from 4B to 7B parameters), MOSAIC reduced harmful actions by up to 50%, increased refusal of malicious requests by over 20% during prompt injection attacks, and actually improved performance on legitimate tasks.

Why It Matters

Multi-step attacks slip through single-turn defenses: An agent might refuse "delete all customer data" in chat, but comply when the same goal is disguised across five tool calls. Safety needs to work at the trajectory level, not just the response level.
Small models are cost-effective but vulnerable: The 4-7B models tested here are what most teams actually deploy for latency and cost reasons, but they're more susceptible to adversarial instructions than frontier models. This shows you can harden them without sacrificing utility.
Explicit safety checks beat implicit reasoning: Long chain-of-thought traces don't automatically include safety considerations. Making safety reasoning a separate, structured step (that you can log and audit) is more reliable than hoping the model thinks about safety somewhere in its reasoning.

One Thing to Try

If you're building an agent system, add a structured safety checkpoint between planning and execution. Before any tool call that modifies state (writes files, sends messages, makes purchases), require the model to output a brief safety assessment covering: (1) potential for harm, (2) whether the action is reversible, and (3) whether it matches the user's original intent. Log these assessments separately—they're your audit trail when things go wrong.

Link to paper

Teaching AI to Think Out Loud Without the Rambling

Teaching AI to Think Less and Say More TL;DR — Researchers found that AI reasoning models ramble too much, and simply asking them to "be concise" then training them to do it naturally cuts their thinking by half while making them more accurate. What It Is When you

Teaching AI to Search Like a Pro: How Reinforcement Learning Created a Next-Gen Enterprise Search Agent

Teaching AI Agents to Search Like Experts (Without Needing Human Labels) TL;DR — Databricks trained an AI agent that's better at searching through company documents and answering complex questions than GPT-5 or Claude, using fake data generated by other AI agents plus reinforcement learning. What It Is Most

Distilled Weekly — Mar 02 - Mar 08, 2026

This week we're diving deep into making AI agents actually useful — and that means teaching them to remember what they've learned, know their limits, and verify their own work. We've got fascinating papers on everything from giving agents memory systems that work like notebooks

Can AI Agents Create Harder Math Problems By Writing Code?

Teaching AI to Write Its Own Math Homework (And Make It Harder) TL;DR — Researchers built a system where AI coding agents take existing math problems and automatically generate harder versions that are still solvable, potentially solving the shortage of challenging problems needed to train advanced math AI. What It