[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.
Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found.
Setup
Baseline: Claude Opus for everything. Tested two strategies:
- Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus
- Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus
Datasets used
All from AdaptLLM/finance-tasks on HuggingFace:
- FiQA-SA — financial tweet sentiment
- Financial Headlines — yes/no classification
- FPB — formal financial news sentiment
- ConvFinQA — multi-turn Q&A on real 10-K filings
Results
| Task | Intra-provider | Flexible (OSS) |
|---|---|---|
| FiQA Sentiment | -78% | -89% |
| Headlines | -57% | -71% |
| FPB Sentiment | -37% | -45% |
| ConvFinQA | -58% | -40% |
Blended average: ~60% savings.
Most interesting finding
ConvFinQA showed 58% intra-provider savings despite being a complex multi-turn QA dataset. The scorer correctly identified that many questions inside long 10-K documents are simple lookups even when the surrounding document is complex.
"What was operating cash flow in 2014?" → answer is in the table → Haiku
"What is the implied effective tax rate adjustment across three years?" → multi-step reasoning → Opus
Caveats
- Financial vertical only
- ECTSum transcripts at ~5K tokens scored complex every time — didn't route. Still tuning for long-form tasks
- Quality verification on representative samples not full automated eval
What datasets do you use for evaluating task-specific LLM routing decisions — specifically trying to find benchmarks that span simple classification through complex multi-step reasoning?
[link] [comments]
Want to read more?
Check out the full article on the original site