distilled

Can AI Really Beat Wall Street? Testing LLMs on Real Trading Decisions

Santthosh Selvadurai

21 Mar 2026 — 2 min read

Your LLM Can Read Balance Sheets, But Can't Read a Stock Chart

TL;DR — When researchers tested 14 LLMs on financial questions requiring both company fundamentals and trading signals, they found a surprising gap: retrieval helps models understand earnings reports, but barely helps them reason about price movements and market timing.

What It Is

Financial analysts need to think about two things at once: what a company's financial statements say (revenue, profit margins, debt levels) and what the market is doing (price momentum, volatility, trading patterns). Most AI benchmarks only test the first part.

FinTradeBench fills this gap with 1,400 questions across NASDAQ-100 companies spanning ten years. Questions fall into three categories: fundamentals-only (pure balance sheet reasoning), trading-signal-only (pure market behavior), and hybrid questions that require both. For example: "Is NVIDIA's July 2025 pullback a buying opportunity?" requires understanding both the company's financial health and what the price chart is actually showing.

The researchers tested models in two settings: zero-shot (just the question) and retrieval-augmented (giving the model access to relevant data first).

Why It Matters

Retrieval doesn't fix everything. Adding context improved fundamentals reasoning by 37%, but barely helped (or even hurt) trading signal questions. If you're building financial copilots, don't assume RAG solves numerical reasoning.
Time-series data is still a weak spot. LLMs struggle to interpret patterns in price movements, volatility, and momentum indicators—the kind of sequential numerical reasoning that matters for market timing decisions.
Real financial decisions need both signals. When fundamentals and market behavior conflict (like Tesla rallying 20% despite missing earnings), models need to reason about why the disconnect exists. Current benchmarks don't test this, but real users need it.

One Thing to Try

If your LLM application handles financial data, test it separately on textual reasoning (extracting facts from reports) versus numerical time-series reasoning (identifying trends in price data). You'll likely find the second fails more often, which means you need specialized prompting, fine-tuning, or even separate models for these tasks rather than assuming one general-purpose LLM handles both equally well.

Link to paper

Distilled Weekly — Mar 16 - Mar 22, 2026

Hey everyone! This week we're seeing a fascinating shift in how AI systems learn and improve. We've got papers on models that game their evaluators, learn by matching "vibes" instead of exact words, adapt from real-world feedback, and even run sophisticated reasoning on your

How We Built AI Embeddings That Work in 200+ Languages Without Breaking the Bank

The Embedding Model That Actually Speaks Your Language TL;DR — A new family of embedding models covers 200+ languages (including underserved ones) in 8 different sizes, beats current leaders on 11 benchmarks, and releases everything openly so you can actually see how it was built. What It Is F2LLM-v2 is

How We Made AI Reasoning Run Fast Enough for Your Phone

Chain-of-Thought Reasoning Doesn't Have to Break Your Phone TL;DR — Researchers got a 7B model to do complex reasoning on a smartphone by using small add-on modules that turn on only when needed, cutting the verbose thinking process down to size without killing accuracy. What It Is You

Language Models That Learn from Their Mistakes in the Real World

Your LLM Could Learn From Its Mistakes — If You Let It TL;DR — Most language models are frozen after training, wasting all the experience they gain from real users. Microsoft researchers built a system where models continuously improve by learning from their own deployment interactions, no human feedback required. What