Can AI Really Beat Wall Street? Testing LLMs on Real Trading Decisions
Your LLM Can Read Balance Sheets, But Can't Read a Stock Chart
TL;DR — When researchers tested 14 LLMs on financial questions requiring both company fundamentals and trading signals, they found a surprising gap: retrieval helps models understand earnings reports, but barely helps them reason about price movements and market timing.
What It Is
Financial analysts need to think about two things at once: what a company's financial statements say (revenue, profit margins, debt levels) and what the market is doing (price momentum, volatility, trading patterns). Most AI benchmarks only test the first part.
FinTradeBench fills this gap with 1,400 questions across NASDAQ-100 companies spanning ten years. Questions fall into three categories: fundamentals-only (pure balance sheet reasoning), trading-signal-only (pure market behavior), and hybrid questions that require both. For example: "Is NVIDIA's July 2025 pullback a buying opportunity?" requires understanding both the company's financial health and what the price chart is actually showing.
The researchers tested models in two settings: zero-shot (just the question) and retrieval-augmented (giving the model access to relevant data first).
Why It Matters
- Retrieval doesn't fix everything. Adding context improved fundamentals reasoning by 37%, but barely helped (or even hurt) trading signal questions. If you're building financial copilots, don't assume RAG solves numerical reasoning.
- Time-series data is still a weak spot. LLMs struggle to interpret patterns in price movements, volatility, and momentum indicators—the kind of sequential numerical reasoning that matters for market timing decisions.
- Real financial decisions need both signals. When fundamentals and market behavior conflict (like Tesla rallying 20% despite missing earnings), models need to reason about why the disconnect exists. Current benchmarks don't test this, but real users need it.
One Thing to Try
If your LLM application handles financial data, test it separately on textual reasoning (extracting facts from reports) versus numerical time-series reasoning (identifying trends in price data). You'll likely find the second fails more often, which means you need specialized prompting, fine-tuning, or even separate models for these tasks rather than assuming one general-purpose LLM handles both equally well.