Agent Architecture Benchmarks

Version 1.0 | 2026-01-29 | Simulated reasoning + empirical estimates

⚠️ Simulated Data See Real Data Below

🆕 Real Benchmark Results Available!

We've now run real benchmarks with actual Gemini models on 33 test cases. The data below represents simulated estimates from our initial planning phase.

View the real benchmark results and deep analysis →

Quick Summary of Real Results:

Key Insight: The real data validates our workflow-first recommendation but revealed important implementation gaps (JSON parsing errors, type coercion issues, timeout problems). Style D achieved 100% safety compliance — the only architecture to catch all dangerous trades.

📋 Simulated Benchmark Methodology (Original Planning Phase)

To objectively compare three agent architecture candidates, we simulated performance on a test suite of 20 user queries spanning simple to complex DeFi strategies.

Test Categories

Evaluation Metrics

📊 Benchmark Assumptions

LLM Performance (Production Anthropic API)

Model Avg Latency Cost (Input/Output per MTok)
Haiku 400ms $0.25 / $1.25
Sonnet 800ms $3 / $15
Opus 1200ms $15 / $75

Tool Execution Times

🔬 Detailed Test Case Examples

Test Case 1: "Lend my 1000 USDC on Aptos" (Simple)

Architecture Latency Cost Correctness Notes
A: Orchestrator-Workers 6.2s $0.21 1.0 4 LLM calls (Opus + 3 Sonnet)
B: Monolithic Agent 7.4s $0.56 1.0 4 Opus reasoning steps
C: Workflow-First ⭐ 3.0s $0.0003 1.0 1 Haiku classifier → workflow

Winner: Architecture C — 2x faster, 700x cheaper, same accuracy

Test Case 6: "Delta-neutral LP on Uniswap v3" (Medium)

Architecture Latency Cost Correctness Notes
A: Orchestrator-Workers ⭐ 9.0s $0.25 0.95 Best balance for medium complexity
B: Monolithic Agent 13.0s $0.96 1.0 Accurate but expensive
C: Workflow-First 13.4s $0.96 1.0 Routes to agent (no workflow)

Winner: Architecture A — Faster and cheaper while maintaining high accuracy

Test Case 13: "Custom market-making strategy" (Complex)

Architecture Latency Cost Correctness Notes
A: Orchestrator-Workers 8s $0.25 0.6 ❌ Struggles with novel strategies
B: Monolithic Agent ⭐ 20s $1.60 0.95 Adaptive loop handles complexity
C: Workflow-First ⭐ 20s $1.60 0.95 Falls back to agent

Winner: B/C Tied — Only agent can handle this complexity

Test Case 19: "I'm getting liquidated! Close NOW" (Emergency)

Architecture Latency Cost Correctness Notes
A: Orchestrator-Workers 4.2s $0.08 1.0 Orchestrator → Executor shortcut
B: Monolithic Agent 4.6s $0.32 1.0 Detects urgency in context
C: Workflow-First ⭐ 2.6s $0.0003 1.0 Deterministic emergency handler

Winner: Architecture C — Fastest response critical for emergencies

📈 Aggregated Results (All 20 Test Cases)

Overall Performance Summary

Metric Architecture A
(Orchestrator-Workers)
Architecture B
(Monolithic Agent)
Architecture C ⭐
(Workflow-First Hybrid)
Avg Latency 6.8s 9.2s 4.1s 🏆
Avg Cost $0.22 $0.68 $0.14 🏆
Correctness 0.91 0.97 🏆 0.94
Robustness 0.92 0.98 🏆 0.88

Performance by Task Category

Category Architecture A Architecture B Architecture C ⭐
Simple (1-5) Good (but overkill)
6.1s, $0.20
Slow & expensive
7.8s, $0.54
Excellent 🏆
3.2s, $0.0003
(10x cheaper, 2x faster)
Medium (6-12) Excellent 🏆
8.2s, $0.24
(best balance)
Good (accurate but costly)
12.1s, $0.88
Good
8.8s, $0.22 (workflow)
$0.89 (agent fallback)
Complex (13-17) Struggles ❌
9.5s, $0.27
0.72 correctness
Excellent 🏆
15.4s, $1.12
0.96 correctness
Excellent 🏆
15.6s, $1.13
0.96 correctness
Edge Cases (18-20) Good
4.8s, $0.12
Excellent
6.1s, $0.45
Excellent 🏆
3.1s, $0.08
(fastest, cheapest)

🎯 Key Insights

Architecture A: The Specialist

Strengths

  • ✅ Clear separation of concerns (easy to debug)
  • ✅ Best for medium-complexity tasks
  • ✅ Moderate cost (Sonnet specialists cheaper than Opus)
  • ✅ Parallelization potential

Weaknesses

  • ❌ Overkill for simple tasks
  • ❌ Struggles with novel strategies
  • ❌ Context loss between handoffs

Best For: Production with well-defined strategy types (80% of cases)

Architecture B: The Generalist

Strengths

  • Highest accuracy (0.97) and robustness (0.98)
  • ✅ Best for complex/novel strategies
  • ✅ Full context maintained (no handoffs)
  • ✅ Can handle unpredictable tasks

Weaknesses

  • Most expensive ($0.68 avg, 5x more than C)
  • Slowest (9.2s avg latency)
  • ❌ Overkill for simple tasks
  • ❌ Harder to debug

Best For: Research/exploration, power users, complex strategies

Architecture C: The Pragmatist ⭐ (WINNER)

Strengths

  • Fastest (4.1s avg) — workflows skip LLM entirely
  • Cheapest ($0.14 avg) — LLM only when needed
  • Best for simple tasks (3.2s, $0.0003)
  • Best for emergencies (2.6s close position)
  • ✅ Falls back to agent for complex cases (B's flexibility when needed)

Weaknesses

  • ❌ Code maintenance (workflows need updating)
  • ❌ Lower robustness for workflows (0.88)
  • ❌ Hybrid complexity (two code paths)
  • ❌ Classifier mistakes route to wrong path

Best For: Production systems optimizing for cost and speed (SELECTED)

✅ Final Recommendation

Winner: Architecture C (Workflow-First Hybrid)

Rationale

Deploy Terminal serves a wide user base with diverse needs:

Architecture C optimizes for the common case while maintaining flexibility:

Initial Workflow Coverage (Launch)

20 Predefined Workflows:

Core DeFi (1-5)

  1. Lending
  2. Dollar-Cost Averaging (DCA)
  3. Grid Trading
  4. Liquid Staking
  5. Yield Farming

Risk Management (6-10)

  1. Stop-Loss / Take-Profit
  2. Limit Orders
  3. Portfolio Rebalancing
  4. Collateral Top-Up
  5. Position Close (emergency)

Advanced (11-15)

  1. Yield Rotation
  2. Recursive Lending Loop
  3. Leverage Farming
  4. LP Provision
  5. Arbitrage (simple cross-DEX)

MEV & Governance (16-20)

  1. Sandwich Protection
  2. MEV Blocker Integration
  3. Airdrop Farming
  4. Governance Voting
  5. Token Swap (aggregated)

Agent-Only (Complex Strategies): Market Making, Custom Strategies, Multi-Protocol Composition, Novel DeFi Interactions

📊 Real Benchmark Results Comparison

The simulated data above was our initial planning estimate. Here's how the real benchmarks (33 test cases with actual Gemini models) compared:

Simulated vs Real: Key Differences

Metric Simulated Prediction Real Result Variance
Workflow-First Latency 4.1s average 3.5s average ✅ 15% faster
Workflow-First Cost $0.14/query $0.00013/query ✅ 1000x cheaper (wrong unit)
Multi-Agent Accuracy 94% expected 72.7% achieved ❌ Implementation bugs
ReAct Viability Slow but viable Catastrophic failure (51% accuracy) ❌ Not production-ready
Safety Compliance Not measured Style D: 100%, Style B: 0% ✅ Critical finding

📝 Lessons Learned

  • Simulations ≠ Reality: Our multi-agent implementation had JSON parsing, timeout, and symbol normalization bugs that simulations didn't capture.
  • Safety is measurable: Real benchmarks revealed that only code-enforced safety rules (Style D) achieve 100% compliance. LLM-based safety (Style B) completely failed.
  • Implementation quality matters more than architecture: Style C should have achieved 94% accuracy (per Chotu's benchmark) but our implementation only got 72.7%.
  • Gemini 1.5 Flash surprised us: Direct function calling (Style A) achieved 81.8% accuracy — better than expected for a single-shot approach.

🔗 Read the full deep analysis with real benchmark data →

📚 Additional Resources