Teaching AI to Debug Code Like a Real Developer

Teaching AI to Debug Code Like a Real Developer

Teaching LLMs to Think Like a Debugger, Not Just an Interpreter

TL;DR — Researchers trained language models to simulate debugger commands like breakpoints and "step over," not just run code line-by-line. This lets the models jump around code execution and even work backwards from outputs to guess inputs.

What It Is

Most LLMs trained on code execution learn to predict what happens line-by-line, like watching a movie from start to finish. But that's not how developers actually debug — you set breakpoints, skip over boring functions, and jump to the interesting parts.

These researchers built "neural debuggers" by training models on execution traces that include debugger actions. The models learn to predict program state after commands like "step into this function" or "run until line 47." They can even run backwards: given a program's output, they can guess what input produced it. They fine-tuned a 32B parameter model and trained a smaller 1.8B model from scratch, both achieving over 90% accuracy on predicting the next program state after debugger commands.

Why It Matters

  • Faster debugging assistance: Instead of re-running entire programs to test fixes, an AI assistant could instantly simulate "what if I change this variable here?" without actual execution — useful for slow integration tests or cloud deployments.
  • Works on broken code: Traditional debuggers need executable code. Neural debuggers can simulate execution even for incomplete or buggy programs, making them useful for code completion and repair scenarios where you're working with fragments.
  • Inverse execution unlocks new capabilities: The ability to work backwards from outputs to inputs could power new features like "generate test inputs that produce this specific error" or "what input would make this function return null?"

One Thing to Try

If you're building coding agents, consider adding a "simulation" step before actual execution. Have your LLM predict what will happen when code runs (especially for specific edge cases or error conditions) before burning compute on real execution. This is especially valuable in multi-step debugging workflows where you're iterating on fixes.

Link to paper

Read more