Smart AI Routing: How to Pick the Right Model Without Breaking the Bank

Smart AI Routing: How to Pick the Right Model Without Breaking the Bank

Stop Paying for GPT-4 When GPT-3.5 Would Work Just Fine

TL;DR — Researchers built a system that learns which AI model to use for each question, cutting costs by up to 70% while keeping answer quality high. It learns from experience instead of needing expensive training data.

What It Is

When you send a query to an LLM, you face a tradeoff: powerful models like GPT-4 give great answers but cost more, while cheaper models work fine for simple questions but fail on hard ones. This paper treats that choice as a learning problem.

Their system watches what happens each time it picks a model, then gets smarter about routing future queries. It uses something called NeuralUCB (a technique borrowed from online advertising) that balances two goals: picking models it thinks will work well, while occasionally trying different options to learn more. The key innovation is defining "success" as a blend of answer quality and cost—so a cheap model that nails a simple question scores better than an expensive one that barely does better.

Unlike existing routers that need labeled examples showing which model works best for thousands of queries, this approach learns on the job. You only pay to run one model per query, and the system improves as it sees more traffic.

Why It Matters

  • You can optimize your LLM spending without collecting expensive training data. Most routing systems require running multiple models on the same questions to build training sets—this one learns from whatever choices it makes in production.
  • The cost-quality tradeoff is explicit and tunable. A single parameter (λ in their formula) lets you dial between "minimize cost" and "maximize quality" based on your product needs, without retraining.
  • It handles distribution shift naturally. When your users start asking different types of questions, the exploration mechanism helps it adapt, rather than getting stuck on outdated routing rules.

One Thing to Try

If you're routing between multiple LLM providers or model tiers today, instrument your system to log three things per request: the embedding of the user's question, which model you used, and a combined score that rewards good answers but penalizes cost (try quality_score * exp(-0.5 * normalized_cost) as a starting formula). Even without changing your routing logic yet, this data lets you analyze whether you're overpaying for quality you don't need.

Link to paper

Read more