distilled

Vision AI That Only Looks When It Needs To: Cutting Inference Costs Without Losing Detail

Santthosh Selvadurai

27 Mar 2026 — 2 min read

Vision-Language Models Don't Need to Throw Away Pixels to Run Faster

TL;DR — Instead of compressing images to speed up vision-language models, VISOR keeps all the pixels but makes the model look at them less often—getting 3-4x speedups while actually improving accuracy on hard visual tasks.

What It Is

Most attempts to speed up vision-language models work by throwing away visual information—merging similar image patches or pruning "unimportant" tokens before feeding them to the language model. VISOR takes the opposite approach: keep all the high-resolution image tokens, but make the model interact with them more selectively.

The key insight is that models don't need to deeply process visual information at every layer. VISOR uses cheap "cross-attention" layers that let text tokens peek at image tokens without updating them, then strategically places a few expensive "self-attention" layers that actually refine the visual representations. Think of it like skimming a document most of the time, but occasionally stopping to read carefully when needed.

The system even learns to adapt per-sample—easy questions get minimal visual processing, while complex visual reasoning tasks automatically trigger more computation where the image tokens get updated and refined.

Why It Matters

Solves the hard problems: Token reduction methods work fine on simple tasks but crash on detailed visual understanding (like counting objects or reading small text). VISOR excels exactly where compression methods fail, without the information bottleneck.
Actually practical speedups: You get 3-4x FLOPs reduction on challenging high-resolution tasks, not just on toy benchmarks. The method even stacks with existing token reduction approaches if you need even more efficiency.
One model, multiple budgets: Instead of training separate models for different speed/accuracy tradeoffs, you train once and deploy with dynamic computation—the model automatically uses more layers when the question demands it.

One Thing to Try

If you're currently using token pruning or merging to speed up your vision-language pipeline, benchmark it specifically on tasks requiring fine-grained visual detail (small text, counting, spatial relationships). You might discover your "optimization" is silently failing on exactly the cases users care about most—where sparse layer execution could help instead.

Link to paper

Why LLMs Can't Keep Their Story Straight on Gender Pronouns

Your LLM Changes Its Mind Based on Irrelevant Context (And That's a Problem) TL;DR — LLMs give dramatically different answers to the exact same question depending on unrelated sentences that appear before it. This happens even when those extra sentences contain zero useful information, breaking a core assumption

Distilled Weekly — Mar 16 - Mar 22, 2026

Hey everyone! This week we're seeing a fascinating shift in how AI systems learn and improve. We've got papers on models that game their evaluators, learn by matching "vibes" instead of exact words, adapt from real-world feedback, and even run sophisticated reasoning on your

How We Built AI Embeddings That Work in 200+ Languages Without Breaking the Bank

The Embedding Model That Actually Speaks Your Language TL;DR — A new family of embedding models covers 200+ languages (including underserved ones) in 8 different sizes, beats current leaders on 11 benchmarks, and releases everything openly so you can actually see how it was built. What It Is F2LLM-v2 is

Can AI Really Beat Wall Street? Testing LLMs on Real Trading Decisions

Your LLM Can Read Balance Sheets, But Can't Read a Stock Chart TL;DR — When researchers tested 14 LLMs on financial questions requiring both company fundamentals and trading signals, they found a surprising gap: retrieval helps models understand earnings reports, but barely helps them reason about price movements