distilled

Teaching AI Agents to Explore Like They Have a Notebook: How Memory Unlocks Better Problem-Solving

Santthosh Selvadurai

01 Mar 2026 — 2 min read

LLM Agents Keep Getting Stuck. Giving Them Memory and Learning Fixed That.

TL;DR — A new training approach lets AI agents both remember past attempts and learn from them simultaneously, leading to 2x better performance on complex tasks and the ability to tackle new problems without retraining.

What It Is

When you train an LLM to act as an agent (think: booking flights, running science experiments, shopping online), it needs to explore and discover what works. Current approaches hit a wall: they either give agents a notebook to remember past attempts or they update the model's parameters through reinforcement learning, but not both effectively.

EMPO² (Exploratory Memory-Augmented On- and Off-Policy Optimization) does both at once. During training, the agent sometimes uses its memory of past experiences and sometimes doesn't. It then learns in two ways: from fresh experiences it just collected (on-policy) and from older memories (off-policy). This dual approach means the agent gets better at using memory when it's available, but also becomes robust enough to work without it.

On benchmark tasks like navigating a science simulation and online shopping, EMPO² improved performance by 128% and 11% respectively over the previous best method. More impressively, when thrown into completely new tasks, the trained agent could figure them out in just a few tries using only memory—no retraining needed.

Why It Matters

Your agents won't plateau as fast. Most RL training for LLM agents converges too early because they stop exploring. This framework keeps agents curious longer, finding solutions that pure exploitation misses.
Better transfer to new tasks. An agent trained this way learns how to use memory to explore, not just what worked before. That means when users throw unexpected requests at your agent, it adapts faster.
Memory becomes a feature, not a crutch. The training ensures agents work decently without memory but excel with it—giving you deployment flexibility based on latency and cost constraints.

One Thing to Try

If you're training agents with RL, run some rollouts with access to a context window containing past successful trajectories and some without. Use both sets of experiences to update your model. Even a simple version of this hybrid approach can prevent premature convergence on tasks requiring exploration.

Link to paper

Teaching AI to Think Out Loud Without the Rambling

Teaching AI to Think Less and Say More TL;DR — Researchers found that AI reasoning models ramble too much, and simply asking them to "be concise" then training them to do it naturally cuts their thinking by half while making them more accurate. What It Is When you

Teaching AI to Search Like a Pro: How Reinforcement Learning Created a Next-Gen Enterprise Search Agent

Teaching AI Agents to Search Like Experts (Without Needing Human Labels) TL;DR — Databricks trained an AI agent that's better at searching through company documents and answering complex questions than GPT-5 or Claude, using fake data generated by other AI agents plus reinforcement learning. What It Is Most

Distilled Weekly — Mar 02 - Mar 08, 2026

This week we're diving deep into making AI agents actually useful — and that means teaching them to remember what they've learned, know their limits, and verify their own work. We've got fascinating papers on everything from giving agents memory systems that work like notebooks

Can AI Agents Create Harder Math Problems By Writing Code?

Teaching AI to Write Its Own Math Homework (And Make It Harder) TL;DR — Researchers built a system where AI coding agents take existing math problems and automatically generate harder versions that are still solvable, potentially solving the shortage of challenging problems needed to train advanced math AI. What It