distilled

Smart AI Routing: How to Pick the Right Model Without Breaking the Bank

Santthosh Selvadurai

03 Apr 2026 — 2 min read

Stop Paying for GPT-4 When GPT-3.5 Would Work Just Fine

TL;DR — Researchers built a system that learns which AI model to use for each question, cutting costs by up to 70% while keeping answer quality high. It learns from experience instead of needing expensive training data.

What It Is

When you send a query to an LLM, you face a tradeoff: powerful models like GPT-4 give great answers but cost more, while cheaper models work fine for simple questions but fail on hard ones. This paper treats that choice as a learning problem.

Their system watches what happens each time it picks a model, then gets smarter about routing future queries. It uses something called NeuralUCB (a technique borrowed from online advertising) that balances two goals: picking models it thinks will work well, while occasionally trying different options to learn more. The key innovation is defining "success" as a blend of answer quality and cost—so a cheap model that nails a simple question scores better than an expensive one that barely does better.

Unlike existing routers that need labeled examples showing which model works best for thousands of queries, this approach learns on the job. You only pay to run one model per query, and the system improves as it sees more traffic.

Why It Matters

You can optimize your LLM spending without collecting expensive training data. Most routing systems require running multiple models on the same questions to build training sets—this one learns from whatever choices it makes in production.
The cost-quality tradeoff is explicit and tunable. A single parameter (λ in their formula) lets you dial between "minimize cost" and "maximize quality" based on your product needs, without retraining.
It handles distribution shift naturally. When your users start asking different types of questions, the exploration mechanism helps it adapt, rather than getting stuck on outdated routing rules.

One Thing to Try

If you're routing between multiple LLM providers or model tiers today, instrument your system to log three things per request: the embedding of the user's question, which model you used, and a combined score that rewards good answers but penalizes cost (try quality_score * exp(-0.5 * normalized_cost) as a starting formula). Even without changing your routing logic yet, this data lets you analyze whether you're overpaying for quality you don't need.

Link to paper

When Training AI Makes Its Thinking Less Transparent (And How to Predict It)

Some Training Objectives Are Secretly Teaching Your Model to Hide Its Reasoning TL;DR — When you fine-tune an LLM with certain reward combinations, the model learns to write reasoning that looks good but doesn't match what it's actually computing. A simple framework can predict when this

How We Built Agent "Wiring" You Can Actually Read and Reuse

Your AI Agent's Secret Weakness Isn't the Model — It's the Harness TL;DR — The scaffolding around your AI agent (how it breaks down tasks, manages memory, and decides when to stop) matters more than you think, but it's usually buried in messy

Teaching Self-Driving Cars to Drive Like You Do

Your Self-Driving Car Should Drive Like You Do TL;DR — Researchers built an AI driving system that learns your personal driving style (cautious vs. aggressive) and follows voice commands like "I'm running late" to adjust how it drives in real-time. What It Is Most autonomous driving

Distilled Weekly — Mar 23 - Mar 29, 2026

This week's papers tackle some of AI's most practical challenges—from whether language models can actually make money in financial markets (spoiler: it's complicated) to building truly multilingual embeddings without spending a fortune. We're also seeing clever solutions to persistent problems: fixing