distilled

How We Made AI Reasoning Run Fast Enough for Your Phone

Santthosh Selvadurai

20 Mar 2026 — 2 min read

Chain-of-Thought Reasoning Doesn't Have to Break Your Phone

TL;DR — Researchers got a 7B model to do complex reasoning on a smartphone by using small add-on modules that turn on only when needed, cutting the verbose thinking process down to size without killing accuracy.

What It Is

You know how models like o1 "think out loud" through problems, generating hundreds of tokens of reasoning before answering? That works great in the cloud but murders your phone's battery and memory. This team figured out how to make a Qwen 7B model reason efficiently on mobile devices using three clever tricks.

First, they added lightweight LoRA adapters (small bolt-on modules) that activate a "reasoning mode" only when the question actually needs it—simple queries skip the whole chain-of-thought overhead. Second, they trained these adapters with "budget forcing," essentially teaching the model to think more concisely through reinforcement learning that penalizes rambling. Third, they share the KV-cache (the memory of what's been processed) between normal and reasoning modes, so switching between them doesn't double your memory usage.

The result runs on actual smartphones with INT4 quantization (super compressed weights), generating multiple reasoning attempts in parallel to improve accuracy while staying within brutal mobile constraints.

Why It Matters

You can ship reasoning without shipping to the cloud — Privacy-sensitive applications (medical advice, financial planning, personal assistants) can now think through complex problems without sending data to servers.
The LoRA adapter pattern solves the "one model or many" dilemma — Instead of loading entirely different models for different tasks, you load one base model once and hot-swap tiny adapters. Way more memory-efficient than current approaches.
Budget forcing during RL is the move for production reasoning — Those verbose o1-style traces aren't just slow, they're redundant. Training models to reason concisely (not just correctly) is becoming table stakes for deployment.

One Thing to Try

If you're fine-tuning smaller models for reasoning tasks, add a length penalty to your reward function during RL training. Start with a soft penalty (reward decreases gradually after 200 tokens) rather than hard cutoffs, and measure the accuracy/brevity tradeoff on your specific domain—the sweet spot varies wildly between math problems and code generation.

Link to paper

Distilled Weekly — Mar 16 - Mar 22, 2026

Hey everyone! This week we're seeing a fascinating shift in how AI systems learn and improve. We've got papers on models that game their evaluators, learn by matching "vibes" instead of exact words, adapt from real-world feedback, and even run sophisticated reasoning on your

How We Built AI Embeddings That Work in 200+ Languages Without Breaking the Bank

The Embedding Model That Actually Speaks Your Language TL;DR — A new family of embedding models covers 200+ languages (including underserved ones) in 8 different sizes, beats current leaders on 11 benchmarks, and releases everything openly so you can actually see how it was built. What It Is F2LLM-v2 is

Can AI Really Beat Wall Street? Testing LLMs on Real Trading Decisions

Your LLM Can Read Balance Sheets, But Can't Read a Stock Chart TL;DR — When researchers tested 14 LLMs on financial questions requiring both company fundamentals and trading signals, they found a surprising gap: retrieval helps models understand earnings reports, but barely helps them reason about price movements

Language Models That Learn from Their Mistakes in the Real World

Your LLM Could Learn From Its Mistakes — If You Let It TL;DR — Most language models are frozen after training, wasting all the experience they gain from real users. Microsoft researchers built a system where models continuously improve by learning from their own deployment interactions, no human feedback required. What