distilled

Why 0.01% of Tokens Are Breaking Your LLM Training (And How to Fix It)

Santthosh Selvadurai

20 Feb 2026 — 2 min read

Why Your LLM Goes Crazy During Training (And How 0.01% of Tokens Are to Blame)

TL;DR — When training language models with reinforcement learning, about 1 in 10,000 tokens get wildly oversized updates that destabilize everything. Masking just these troublemakers fixes the problem.

What It Is

Researchers at Tsinghua and DiDi figured out why reinforcement learning for LLMs often crashes late in training. They traced the problem to "spurious tokens" — rare, low-confidence words that sneak into otherwise correct answers. Think of a model writing "The answer is 42... uh... definitely 42" where that "uh" is so unexpected it gets a massive gradient update, even though it contributed nothing to getting the right answer.

The math shows these tokens get disproportionate updates because gradient magnitude scales inversely with both token probability and entropy (a measure of how confident the model is). When a rare token appears in a correct response, it inherits the full reward for that response, creating an amplified update that throws training off course. The solution, called STAPO, simply masks updates for tokens that are both rare AND low-entropy, then redistributes the learning signal across the remaining tokens. This affects only 0.01% of tokens but improves performance by 7% on average.

Why It Matters

Your RL training might be failing for a fixable reason — if you're seeing late-stage collapse where your model suddenly starts generating garbage, this tiny fraction of tokens could be the culprit, not your learning rate or regularization scheme
Existing stability tricks are Band-Aids — methods like entropy regularization or advantage clipping try to smooth over the whole distribution, but the real problem is localized to a vanishingly small set of pathological tokens
This scales across model sizes — the technique worked consistently on 1.7B, 8B, and 14B parameter models, suggesting it's addressing a fundamental issue in how RL credit assignment works at the token level

One Thing to Try

If you're doing RL fine-tuning and tracking entropy during training, plot token-level gradient magnitudes against token probability. If you see a spike in gradient size for very low-probability tokens in your positive-reward samples, you've found your spurious tokens. Start by logging what percentage of your tokens fall into the low-probability, low-entropy regime — if it's around 0.01-0.1%, those are prime candidates for masking.

Link to paper

Teaching AI to Think Out Loud Without the Rambling

Teaching AI to Think Less and Say More TL;DR — Researchers found that AI reasoning models ramble too much, and simply asking them to "be concise" then training them to do it naturally cuts their thinking by half while making them more accurate. What It Is When you

Teaching AI to Search Like a Pro: How Reinforcement Learning Created a Next-Gen Enterprise Search Agent

Teaching AI Agents to Search Like Experts (Without Needing Human Labels) TL;DR — Databricks trained an AI agent that's better at searching through company documents and answering complex questions than GPT-5 or Claude, using fake data generated by other AI agents plus reinforcement learning. What It Is Most

Distilled Weekly — Mar 02 - Mar 08, 2026

This week we're diving deep into making AI agents actually useful — and that means teaching them to remember what they've learned, know their limits, and verify their own work. We've got fascinating papers on everything from giving agents memory systems that work like notebooks

Can AI Agents Create Harder Math Problems By Writing Code?

Teaching AI to Write Its Own Math Homework (And Make It Harder) TL;DR — Researchers built a system where AI coding agents take existing math problems and automatically generate harder versions that are still solvable, potentially solving the shortage of challenging problems needed to train advanced math AI. What It