GLM-5: Teaching AI to Actually Build Software, Not Just Suggest Code

GLM-5: Teaching AI to Actually Build Software, Not Just Suggest Code

GLM-5: Teaching AI to Actually Build Software, Not Just Write Code Snippets

TL;DR — GLM-5 shifts from helping you write code to actually acting as a software engineer that can handle complex, multi-hour development tasks autonomously.

What It Is

The team at Zhipu AI built GLM-5 to solve a problem we've all felt: current AI coding assistants are great at writing functions, but terrible at being actual engineering partners. They coined the term "vibe coding" for what we have now—AI that writes code based on vibes but can't handle real software projects end-to-end.

GLM-5 uses three key innovations to bridge this gap. First, it implements DSA (a smarter attention mechanism that only focuses computing power where it matters), making it cheaper to run while handling 200,000 token contexts. Second, they built an entirely new reinforcement learning system that separates the "trying things" phase from the "learning from mistakes" phase, letting them train on complex, multi-step tasks much faster. Third, they trained the model in stages—reasoning first, then agent behavior, then general skills—while using distillation to prevent it from forgetting what it learned earlier.

The results are striking: GLM-5 scores 50 on the Intelligence Index (first open-weights model to hit this milestone), ranks #1 among open models on LMArena for both text and code, and can run a simulated vending machine business for a full year while managing resources effectively.

Why It Matters

  • Long-horizon tasks are now viable: If you've been frustrated by AI losing the thread after a few turns, GLM-5's ability to maintain coherence over hours (not minutes) opens up entirely new use cases—think automated refactoring, multi-file feature implementation, or ongoing codebase maintenance.
  • Open weights at frontier performance: GLM-5 matches Claude Opus 4.5 and GPT-5.2 on many benchmarks while being open-source, meaning you can actually deploy it on your own infrastructure without per-token API costs eating your budget.
  • The training recipe matters more than ever: Their staged RL approach (reasoning → agentic → general) with cross-stage distillation is a blueprint for anyone fine-tuning models—it shows you can teach specialized skills without destroying general capabilities.

One Thing to Try

If you're evaluating coding assistants for your team, test them on a real multi-hour task from your backlog—something requiring reading multiple files, making a plan, and executing across several PRs. Don't just measure "can it write this function" but "can it actually complete this feature ticket." GLM-5's performance on Vending-Bench 2 (managing a business for a full year) suggests the bar for "agentic" capability should be much higher than we've been setting it.

Link to paper

Read more