Vision AI That Only Looks When It Needs To: Cutting Inference Costs Without Losing Detail
Vision-Language Models Don't Need to Throw Away Pixels to Run Faster
TL;DR — Instead of compressing images to speed up vision-language models, VISOR keeps all the pixels but makes the model look at them less often—getting 3-4x speedups while actually improving accuracy on hard visual tasks.
What It Is
Most attempts to speed up vision-language models work by throwing away visual information—merging similar image patches or pruning "unimportant" tokens before feeding them to the language model. VISOR takes the opposite approach: keep all the high-resolution image tokens, but make the model interact with them more selectively.
The key insight is that models don't need to deeply process visual information at every layer. VISOR uses cheap "cross-attention" layers that let text tokens peek at image tokens without updating them, then strategically places a few expensive "self-attention" layers that actually refine the visual representations. Think of it like skimming a document most of the time, but occasionally stopping to read carefully when needed.
The system even learns to adapt per-sample—easy questions get minimal visual processing, while complex visual reasoning tasks automatically trigger more computation where the image tokens get updated and refined.
Why It Matters
- Solves the hard problems: Token reduction methods work fine on simple tasks but crash on detailed visual understanding (like counting objects or reading small text). VISOR excels exactly where compression methods fail, without the information bottleneck.
- Actually practical speedups: You get 3-4x FLOPs reduction on challenging high-resolution tasks, not just on toy benchmarks. The method even stacks with existing token reduction approaches if you need even more efficiency.
- One model, multiple budgets: Instead of training separate models for different speed/accuracy tradeoffs, you train once and deploy with dynamic computation—the model automatically uses more layers when the question demands it.
One Thing to Try
If you're currently using token pruning or merging to speed up your vision-language pipeline, benchmark it specifically on tasks requiring fine-grained visual detail (small text, counting, spatial relationships). You might discover your "optimization" is silently failing on exactly the cases users care about most—where sparse layer execution could help instead.