Can AI Agents Create Harder Math Problems By Writing Code?
Teaching AI to Write Its Own Math Homework (And Make It Harder)
TL;DR — Researchers built a system where AI coding agents take existing math problems and automatically generate harder versions that are still solvable, potentially solving the shortage of challenging problems needed to train advanced math AI.
What It Is
We're running into a strange problem: our best AI models are getting so good at math that we're running out of hard problems to train and test them on. Creating IMO-level math problems requires serious expertise, and humans can't make them fast enough.
The Code2Math team had an idea: what if AI agents could use code to explore mathematical spaces and evolve existing problems into harder ones? They built a multi-agent system that takes a seed problem, writes code to explore variations computationally, then generates new problems based on what it discovers. For example, starting with a problem about finding a list of numbers that sum to 30 with specific properties, the agent explored thousands of configurations through code and created a harder version asking for the maximum list length given a sum of 323.
The key insight is that many hard math problems come from computational exploration—trying lots of examples, finding patterns, searching for edge cases. Code agents can do this exploration automatically and at scale.
Why It Matters
- Training data bottleneck solved: If you're building or fine-tuning math reasoning models, you can now generate challenging problems programmatically instead of waiting for human experts to write them
- Automatic difficulty scaling: You can take problems your model already solves and systematically generate harder variants, creating a curriculum that grows with your model's capabilities
- Validation built-in: Unlike purely language-based problem generation (which often creates unsolvable or trivial problems), code execution provides automatic verification that evolved problems are actually solvable and structurally sound
One Thing to Try
If you're evaluating a math-capable model, take 10 problems it solves correctly and run them through a code agent with instructions to "find a harder variant by exploring edge cases computationally." Test whether your model still succeeds on the evolved versions—this gives you a quick difficulty calibration and might reveal capability gaps that standard benchmarks miss.