distilled

Language Models That Learn from Their Mistakes in the Real World

Santthosh Selvadurai

20 Mar 2026 — 2 min read

Your LLM Could Learn From Its Mistakes — If You Let It

TL;DR — Most language models are frozen after training, wasting all the experience they gain from real users. Microsoft researchers built a system where models continuously improve by learning from their own deployment interactions, no human feedback required.

What It Is

Right now, when you deploy an LLM, it's basically frozen in time. It makes mistakes, gets corrections, sees what works — then forgets everything when the conversation ends. Online Experiential Learning (OEL) changes this by creating a learning loop that runs during deployment.

Here's how it works: As your model interacts with users or environments, it collects transcripts of what happened. Then, instead of just storing raw logs, the system extracts "experiential knowledge" — essentially, lessons learned from those interactions. Finally, it trains the model to internalize these lessons so it doesn't need to look them up every time. The improved model goes back into production, collects better experiences, extracts better lessons, and the cycle continues.

The clever bit? This all happens without reward scores, human labelers, or access to the original environment where the model was deployed. It just needs text transcripts of what the model tried and what happened.

Why It Matters

Your deployment data becomes training data automatically. Instead of the standard "train once, deploy forever" approach, you can continuously improve models using the interactions they're already having. No need to periodically collect human annotations or build elaborate simulation environments.
It gets more efficient over time, not just more accurate. In their tests, models didn't just improve task performance — they also generated shorter, more focused responses as they learned. You're literally reducing inference costs as the model learns what works.
It works without the usual RL infrastructure. You don't need reward models, human raters scoring outputs, or verifiable scoring functions. If your environment gives any kind of text feedback (error messages, success confirmations, user corrections), that's enough signal to learn from.

One Thing to Try

Start logging interaction trajectories for your deployed models right now, even if you're not ready to implement online learning. Capture the model's outputs alongside any signal of success or failure — whether that's explicit user feedback, downstream system responses, or task completion indicators. When you're ready to retrain, you'll have a goldmine of real-world experience to learn from instead of relying solely on synthetic data or expensive human annotations.

Link to paper

Distilled Weekly — Mar 16 - Mar 22, 2026

Hey everyone! This week we're seeing a fascinating shift in how AI systems learn and improve. We've got papers on models that game their evaluators, learn by matching "vibes" instead of exact words, adapt from real-world feedback, and even run sophisticated reasoning on your

How We Built AI Embeddings That Work in 200+ Languages Without Breaking the Bank

The Embedding Model That Actually Speaks Your Language TL;DR — A new family of embedding models covers 200+ languages (including underserved ones) in 8 different sizes, beats current leaders on 11 benchmarks, and releases everything openly so you can actually see how it was built. What It Is F2LLM-v2 is

Can AI Really Beat Wall Street? Testing LLMs on Real Trading Decisions

Your LLM Can Read Balance Sheets, But Can't Read a Stock Chart TL;DR — When researchers tested 14 LLMs on financial questions requiring both company fundamentals and trading signals, they found a surprising gap: retrieval helps models understand earnings reports, but barely helps them reason about price movements

How We Made AI Reasoning Run Fast Enough for Your Phone

Chain-of-Thought Reasoning Doesn't Have to Break Your Phone TL;DR — Researchers got a 7B model to do complex reasoning on a smartphone by using small add-on modules that turn on only when needed, cutting the verbose thinking process down to size without killing accuracy. What It Is You