Agent Architecture Benchmarks
Version 1.0 | 2026-01-29 | Simulated reasoning + empirical estimates
⚠️ Simulated Data
See Real Data Below
🆕 Real Benchmark Results Available!
We've now run real benchmarks with actual Gemini models on 33 test cases. The data below represents simulated estimates from our initial planning phase.
View the real benchmark results and deep analysis →
Quick Summary of Real Results:
- Style D (Workflow-First Hybrid): 87.9% intent accuracy, 3.5s latency, $0.00013/query ✅ Winner
- Style A (Direct Function Calling): 81.8% intent accuracy, 2.2s latency, $0.00012/query
- Style C (Multi-Agent Router): 72.7% intent accuracy, 15.2s latency, $0.00059/query
- Style B (ReAct Loop): 51.5% intent accuracy, 9.7s latency, $0.00400/query ❌ Failed
Key Insight: The real data validates our workflow-first recommendation but revealed important implementation gaps (JSON parsing errors, type coercion issues, timeout problems). Style D achieved 100% safety compliance — the only architecture to catch all dangerous trades.
📋 Simulated Benchmark Methodology (Original Planning Phase)
To objectively compare three agent architecture candidates, we simulated performance on a test suite of 20 user queries spanning simple to complex DeFi strategies.
Test Categories
- Simple Strategies (1-5): Lend, DCA, stake, grid bot, highest yield deposit
- Medium Strategies (6-12): Delta-neutral LP, arbitrage, leverage, liquidation monitoring, yield rotation
- Complex Strategies (13-17): Market-making, basis trade, trend-following, portfolio optimization, backtesting
- Edge Cases (18-20): Portfolio query, emergency liquidation close, explain APY drop
Evaluation Metrics
- Correctness: Does it produce the right strategy? (0-1 score)
- Latency: Time to first action (seconds)
- Cost: Total LLM tokens used ($USD)
- Steps: Number of LLM calls or tool invocations
- Robustness: Can it handle errors gracefully? (0-1 score)
📊 Benchmark Assumptions
LLM Performance (Production Anthropic API)
| Model |
Avg Latency |
Cost (Input/Output per MTok) |
| Haiku |
400ms |
$0.25 / $1.25 |
| Sonnet |
800ms |
$3 / $15 |
| Opus |
1200ms |
$15 / $75 |
Tool Execution Times
- Simple queries (balance, price): 100ms
- Market data fetch: 300ms
- Simulation: 500ms
- Transaction execution: 2000ms (on-chain confirmation)
🔬 Detailed Test Case Examples
Test Case 1: "Lend my 1000 USDC on Aptos" (Simple)
| Architecture |
Latency |
Cost |
Correctness |
Notes |
| A: Orchestrator-Workers |
6.2s |
$0.21 |
1.0 |
4 LLM calls (Opus + 3 Sonnet) |
| B: Monolithic Agent |
7.4s |
$0.56 |
1.0 |
4 Opus reasoning steps |
| C: Workflow-First ⭐ |
3.0s |
$0.0003 |
1.0 |
1 Haiku classifier → workflow |
Winner: Architecture C — 2x faster, 700x cheaper, same accuracy
Test Case 6: "Delta-neutral LP on Uniswap v3" (Medium)
| Architecture |
Latency |
Cost |
Correctness |
Notes |
| A: Orchestrator-Workers ⭐ |
9.0s |
$0.25 |
0.95 |
Best balance for medium complexity |
| B: Monolithic Agent |
13.0s |
$0.96 |
1.0 |
Accurate but expensive |
| C: Workflow-First |
13.4s |
$0.96 |
1.0 |
Routes to agent (no workflow) |
Winner: Architecture A — Faster and cheaper while maintaining high accuracy
Test Case 13: "Custom market-making strategy" (Complex)
| Architecture |
Latency |
Cost |
Correctness |
Notes |
| A: Orchestrator-Workers |
8s |
$0.25 |
0.6 |
❌ Struggles with novel strategies |
| B: Monolithic Agent ⭐ |
20s |
$1.60 |
0.95 |
Adaptive loop handles complexity |
| C: Workflow-First ⭐ |
20s |
$1.60 |
0.95 |
Falls back to agent |
Winner: B/C Tied — Only agent can handle this complexity
Test Case 19: "I'm getting liquidated! Close NOW" (Emergency)
| Architecture |
Latency |
Cost |
Correctness |
Notes |
| A: Orchestrator-Workers |
4.2s |
$0.08 |
1.0 |
Orchestrator → Executor shortcut |
| B: Monolithic Agent |
4.6s |
$0.32 |
1.0 |
Detects urgency in context |
| C: Workflow-First ⭐ |
2.6s |
$0.0003 |
1.0 |
Deterministic emergency handler |
Winner: Architecture C — Fastest response critical for emergencies
📈 Aggregated Results (All 20 Test Cases)
Overall Performance Summary
| Metric |
Architecture A (Orchestrator-Workers) |
Architecture B (Monolithic Agent) |
Architecture C ⭐ (Workflow-First Hybrid) |
| Avg Latency |
6.8s |
9.2s |
4.1s 🏆 |
| Avg Cost |
$0.22 |
$0.68 |
$0.14 🏆 |
| Correctness |
0.91 |
0.97 🏆 |
0.94 |
| Robustness |
0.92 |
0.98 🏆 |
0.88 |
Performance by Task Category
| Category |
Architecture A |
Architecture B |
Architecture C ⭐ |
| Simple (1-5) |
Good (but overkill) 6.1s, $0.20 |
Slow & expensive 7.8s, $0.54 |
Excellent 🏆 3.2s, $0.0003 (10x cheaper, 2x faster) |
| Medium (6-12) |
Excellent 🏆 8.2s, $0.24 (best balance) |
Good (accurate but costly) 12.1s, $0.88 |
Good 8.8s, $0.22 (workflow) $0.89 (agent fallback) |
| Complex (13-17) |
Struggles ❌ 9.5s, $0.27 0.72 correctness |
Excellent 🏆 15.4s, $1.12 0.96 correctness |
Excellent 🏆 15.6s, $1.13 0.96 correctness |
| Edge Cases (18-20) |
Good 4.8s, $0.12 |
Excellent 6.1s, $0.45 |
Excellent 🏆 3.1s, $0.08 (fastest, cheapest) |
🎯 Key Insights
Architecture A: The Specialist
Strengths
- ✅ Clear separation of concerns (easy to debug)
- ✅ Best for medium-complexity tasks
- ✅ Moderate cost (Sonnet specialists cheaper than Opus)
- ✅ Parallelization potential
Weaknesses
- ❌ Overkill for simple tasks
- ❌ Struggles with novel strategies
- ❌ Context loss between handoffs
Best For: Production with well-defined strategy types (80% of cases)
Architecture B: The Generalist
Strengths
- ✅ Highest accuracy (0.97) and robustness (0.98)
- ✅ Best for complex/novel strategies
- ✅ Full context maintained (no handoffs)
- ✅ Can handle unpredictable tasks
Weaknesses
- ❌ Most expensive ($0.68 avg, 5x more than C)
- ❌ Slowest (9.2s avg latency)
- ❌ Overkill for simple tasks
- ❌ Harder to debug
Best For: Research/exploration, power users, complex strategies
Architecture C: The Pragmatist ⭐ (WINNER)
Strengths
- ✅ Fastest (4.1s avg) — workflows skip LLM entirely
- ✅ Cheapest ($0.14 avg) — LLM only when needed
- ✅ Best for simple tasks (3.2s, $0.0003)
- ✅ Best for emergencies (2.6s close position)
- ✅ Falls back to agent for complex cases (B's flexibility when needed)
Weaknesses
- ❌ Code maintenance (workflows need updating)
- ❌ Lower robustness for workflows (0.88)
- ❌ Hybrid complexity (two code paths)
- ❌ Classifier mistakes route to wrong path
Best For: Production systems optimizing for cost and speed (SELECTED)
✅ Final Recommendation
Winner: Architecture C (Workflow-First Hybrid)
Rationale
Deploy Terminal serves a wide user base with diverse needs:
- 80% of requests are common strategies (DCA, lending, grid trading, etc.)
- 15% of requests are medium complexity (require some reasoning)
- 5% of requests are novel/complex (custom strategies, edge cases)
Architecture C optimizes for the common case while maintaining flexibility:
- Fast & cheap for 80% of requests (workflows)
- Falls back to powerful agent for 20% (Opus reasoning)
- Best user experience (low latency)
- Sustainable economics (low cost per user)
Initial Workflow Coverage (Launch)
20 Predefined Workflows:
Core DeFi (1-5)
- Lending
- Dollar-Cost Averaging (DCA)
- Grid Trading
- Liquid Staking
- Yield Farming
Risk Management (6-10)
- Stop-Loss / Take-Profit
- Limit Orders
- Portfolio Rebalancing
- Collateral Top-Up
- Position Close (emergency)
Advanced (11-15)
- Yield Rotation
- Recursive Lending Loop
- Leverage Farming
- LP Provision
- Arbitrage (simple cross-DEX)
MEV & Governance (16-20)
- Sandwich Protection
- MEV Blocker Integration
- Airdrop Farming
- Governance Voting
- Token Swap (aggregated)
Agent-Only (Complex Strategies): Market Making, Custom Strategies, Multi-Protocol Composition, Novel DeFi Interactions
📊 Real Benchmark Results Comparison
The simulated data above was our initial planning estimate. Here's how the real benchmarks (33 test cases with actual Gemini models) compared:
Simulated vs Real: Key Differences
| Metric |
Simulated Prediction |
Real Result |
Variance |
| Workflow-First Latency |
4.1s average |
3.5s average |
✅ 15% faster |
| Workflow-First Cost |
$0.14/query |
$0.00013/query |
✅ 1000x cheaper (wrong unit) |
| Multi-Agent Accuracy |
94% expected |
72.7% achieved |
❌ Implementation bugs |
| ReAct Viability |
Slow but viable |
Catastrophic failure (51% accuracy) |
❌ Not production-ready |
| Safety Compliance |
Not measured |
Style D: 100%, Style B: 0% |
✅ Critical finding |
📝 Lessons Learned
- Simulations ≠ Reality: Our multi-agent implementation had JSON parsing, timeout, and symbol normalization bugs that simulations didn't capture.
- Safety is measurable: Real benchmarks revealed that only code-enforced safety rules (Style D) achieve 100% compliance. LLM-based safety (Style B) completely failed.
- Implementation quality matters more than architecture: Style C should have achieved 94% accuracy (per Chotu's benchmark) but our implementation only got 72.7%.
- Gemini 1.5 Flash surprised us: Direct function calling (Style A) achieved 81.8% accuracy — better than expected for a single-shot approach.
🔗 Read the full deep analysis with real benchmark data →