[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found.

Setup

Baseline: Claude Opus for everything. Tested two strategies:

Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus
Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus

Datasets used

All from AdaptLLM/finance-tasks on HuggingFace:

FiQA-SA — financial tweet sentiment
Financial Headlines — yes/no classification
FPB — formal financial news sentiment
ConvFinQA — multi-turn Q&A on real 10-K filings

Results

Task	Intra-provider	Flexible (OSS)
FiQA Sentiment	-78%	-89%
Headlines	-57%	-71%
FPB Sentiment	-37%	-45%
ConvFinQA	-58%	-40%

Blended average: ~60% savings.

Most interesting finding

ConvFinQA showed 58% intra-provider savings despite being a complex multi-turn QA dataset. The scorer correctly identified that many questions inside long 10-K documents are simple lookups even when the surrounding document is complex.

"What was operating cash flow in 2014?" → answer is in the table → Haiku

"What is the implied effective tax rate adjustment across three years?" → multi-step reasoning → Opus

Caveats

Financial vertical only
ECTSum transcripts at ~5K tokens scored complex every time — didn't route. Still tuning for long-form tasks
Quality verification on representative samples not full automated eval

What datasets do you use for evaluating task-specific LLM routing decisions — specifically trying to find benchmarks that span simple classification through complex multi-step reasoning?

submitted by /u/Dramatic_Strain7370
[link] [comments]

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Want to read more?

Tagged with