distilled

Why Vision-Language AI Models Can "See" Images But Fail at Basic Reasoning (And How to Fix It)

Your Multimodal Model Can See the Math Problem But Forgets How to Solve It

TL;DR — Multimodal AI models with Mixture-of-Experts architecture can read text from images perfectly but somehow fail at reasoning tasks they'd solve easily if you just typed the same text. The problem? The image distracts the routing system from activating the right "expert" neurons.

What It Is

Picture this: you show a vision-language model a screenshot of a math word problem. It reads every number correctly, understands all the words, but then... gets the answer wrong. Now paste that exact same text into a chat window, and suddenly it solves the problem perfectly.

The researchers found this happens because of how Mixture-of-Experts (MoE) models work. These models contain specialized sub-networks called "experts" — some handle visual processing, others handle math reasoning, others handle language tasks. A routing mechanism decides which experts to activate for each input.

Here's the issue: when processing images, the router gets distracted. It activates the vision experts (good!) but then fails to properly activate the reasoning experts that live in the middle layers of the network. The model can see but can't think because the wrong cognitive modules are firing. When the researchers forced the model to activate the right reasoning experts, performance jumped by up to 3.17% on visual reasoning tasks.

Why It Matters

Your multimodal MoE might be underperforming for fixable reasons — it's not that the model lacks reasoning capability, it's that images cause suboptimal expert selection. The smarts are there; they're just not being activated.
Text-only benchmarks hide real-world problems — a model that aces text-based reasoning tests might struggle when that same reasoning is needed for visual inputs, even when perception is perfect. Test your models with both modalities for tasks that matter to your use case.
Simple routing interventions can recover lost performance — you don't need to retrain the entire model. Identifying and boosting the right domain experts at inference time provides consistent gains across different tasks and model architectures.

One Thing to Try

If you're using multimodal MoE models for reasoning tasks (math, logic, analysis), run an A/B test: extract text from images with OCR and feed it as pure text versus feeding the original image. If you see a significant accuracy gap on identical content, you're experiencing routing distraction. Consider using the text pathway for critical reasoning tasks, or explore routing intervention techniques if you must process visual inputs.

Link to paper

Distilled Weekly — Apr 13 - Apr 19, 2026

Welcome to this week's Distilled! We're diving into two papers that expose fundamental gaps in how AI systems actually work versus how we think they work. From vision-language models that can identify objects but stumble over simple spatial reasoning, to agents that reach for tools

AI Agents That Read Research Papers So You Don't Have To

Your AI Research Assistant Just Got a Team of Specialists TL;DR — Paper Circle uses multiple AI agents working together to find relevant research papers, build knowledge maps showing how ideas connect, and generate detailed reviews—all while showing you exactly how it reached each conclusion. What It Is Keeping

LLMs That Learn On-The-Fly: Making AI Models Update Themselves During Use

LLMs That Learn While They Think TL;DR — Researchers found a way to let language models update their own weights during a conversation, making them better at handling long contexts without retraining the entire model from scratch. What It Is Most LLMs work like a book that's been

When AI Agents Need to Learn They Don't Always Need Tools

AI Agents Are Using Tools When They Should Just Think TL;DR — Multimodal AI agents call external tools way too often, even when they could answer questions just by looking at the image. A new training method teaches them when not to use tools, making them 50x more efficient without