Teaching AI Agents to Explore Like They Have a Notebook: How Memory Unlocks Better Problem-Solving

Teaching AI Agents to Explore Like They Have a Notebook: How Memory Unlocks Better Problem-Solving

LLM Agents Keep Getting Stuck. Giving Them Memory and Learning Fixed That.

TL;DR — A new training approach lets AI agents both remember past attempts and learn from them simultaneously, leading to 2x better performance on complex tasks and the ability to tackle new problems without retraining.

What It Is

When you train an LLM to act as an agent (think: booking flights, running science experiments, shopping online), it needs to explore and discover what works. Current approaches hit a wall: they either give agents a notebook to remember past attempts or they update the model's parameters through reinforcement learning, but not both effectively.

EMPO² (Exploratory Memory-Augmented On- and Off-Policy Optimization) does both at once. During training, the agent sometimes uses its memory of past experiences and sometimes doesn't. It then learns in two ways: from fresh experiences it just collected (on-policy) and from older memories (off-policy). This dual approach means the agent gets better at using memory when it's available, but also becomes robust enough to work without it.

On benchmark tasks like navigating a science simulation and online shopping, EMPO² improved performance by 128% and 11% respectively over the previous best method. More impressively, when thrown into completely new tasks, the trained agent could figure them out in just a few tries using only memory—no retraining needed.

Why It Matters

  • Your agents won't plateau as fast. Most RL training for LLM agents converges too early because they stop exploring. This framework keeps agents curious longer, finding solutions that pure exploitation misses.
  • Better transfer to new tasks. An agent trained this way learns how to use memory to explore, not just what worked before. That means when users throw unexpected requests at your agent, it adapts faster.
  • Memory becomes a feature, not a crutch. The training ensures agents work decently without memory but excel with it—giving you deployment flexibility based on latency and cost constraints.

One Thing to Try

If you're training agents with RL, run some rollouts with access to a context window containing past successful trajectories and some without. Use both sets of experiences to update your model. Even a simple version of this hybrid approach can prevent premature convergence on tasks requiring exploration.

Link to paper

Read more