Why 0.01% of Tokens Are Breaking Your LLM Training (And How to Fix It)

Why 0.01% of Tokens Are Breaking Your LLM Training (And How to Fix It)

Why Your LLM Goes Crazy During Training (And How 0.01% of Tokens Are to Blame)

TL;DR — When training language models with reinforcement learning, about 1 in 10,000 tokens get wildly oversized updates that destabilize everything. Masking just these troublemakers fixes the problem.

What It Is

Researchers at Tsinghua and DiDi figured out why reinforcement learning for LLMs often crashes late in training. They traced the problem to "spurious tokens" — rare, low-confidence words that sneak into otherwise correct answers. Think of a model writing "The answer is 42... uh... definitely 42" where that "uh" is so unexpected it gets a massive gradient update, even though it contributed nothing to getting the right answer.

The math shows these tokens get disproportionate updates because gradient magnitude scales inversely with both token probability and entropy (a measure of how confident the model is). When a rare token appears in a correct response, it inherits the full reward for that response, creating an amplified update that throws training off course. The solution, called STAPO, simply masks updates for tokens that are both rare AND low-entropy, then redistributes the learning signal across the remaining tokens. This affects only 0.01% of tokens but improves performance by 7% on average.

Why It Matters

  • Your RL training might be failing for a fixable reason — if you're seeing late-stage collapse where your model suddenly starts generating garbage, this tiny fraction of tokens could be the culprit, not your learning rate or regularization scheme
  • Existing stability tricks are Band-Aids — methods like entropy regularization or advantage clipping try to smooth over the whole distribution, but the real problem is localized to a vanishingly small set of pathological tokens
  • This scales across model sizes — the technique worked consistently on 1.7B, 8B, and 14B parameter models, suggesting it's addressing a fundamental issue in how RL credit assignment works at the token level

One Thing to Try

If you're doing RL fine-tuning and tracking entropy during training, plot token-level gradient magnitudes against token probability. If you see a spike in gradient size for very low-probability tokens in your positive-reward samples, you've found your spurious tokens. Start by logging what percentage of your tokens fall into the low-probability, low-entropy regime — if it's around 0.01-0.1%, those are prime candidates for masking.

Link to paper

Read more