Benchmarks - Deploy Terminal

🆕 Real Benchmark Results Available!

We've now run real benchmarks with actual Gemini models on 33 test cases. The data below represents simulated estimates from our initial planning phase.

View the real benchmark results and deep analysis →

Quick Summary of Real Results:

Style D (Workflow-First Hybrid): 87.9% intent accuracy, 3.5s latency, $0.00013/query ✅ Winner
Style A (Direct Function Calling): 81.8% intent accuracy, 2.2s latency, $0.00012/query
Style C (Multi-Agent Router): 72.7% intent accuracy, 15.2s latency, $0.00059/query
Style B (ReAct Loop): 51.5% intent accuracy, 9.7s latency, $0.00400/query ❌ Failed

Key Insight: The real data validates our workflow-first recommendation but revealed important implementation gaps (JSON parsing errors, type coercion issues, timeout problems). Style D achieved 100% safety compliance — the only architecture to catch all dangerous trades.

📋 Simulated Benchmark Methodology (Original Planning Phase)

To objectively compare three agent architecture candidates, we simulated performance on a test suite of 20 user queries spanning simple to complex DeFi strategies.

Test Categories

Simple Strategies (1-5): Lend, DCA, stake, grid bot, highest yield deposit
Medium Strategies (6-12): Delta-neutral LP, arbitrage, leverage, liquidation monitoring, yield rotation
Complex Strategies (13-17): Market-making, basis trade, trend-following, portfolio optimization, backtesting
Edge Cases (18-20): Portfolio query, emergency liquidation close, explain APY drop

Evaluation Metrics

Correctness: Does it produce the right strategy? (0-1 score)
Latency: Time to first action (seconds)
Cost: Total LLM tokens used ($USD)
Steps: Number of LLM calls or tool invocations
Robustness: Can it handle errors gracefully? (0-1 score)

📊 Benchmark Assumptions

LLM Performance (Production Anthropic API)

Model	Avg Latency	Cost (Input/Output per MTok)
Haiku	400ms	$0.25 / $1.25
Sonnet	800ms	$3 / $15
Opus	1200ms	$15 / $75

Tool Execution Times

Simple queries (balance, price): 100ms
Market data fetch: 300ms
Simulation: 500ms
Transaction execution: 2000ms (on-chain confirmation)

🔬 Detailed Test Case Examples

Test Case 1: "Lend my 1000 USDC on Aptos" (Simple)

Architecture	Latency	Cost	Correctness	Notes
A: Orchestrator-Workers	6.2s	$0.21	1.0	4 LLM calls (Opus + 3 Sonnet)
B: Monolithic Agent	7.4s	$0.56	1.0	4 Opus reasoning steps
C: Workflow-First ⭐	3.0s	$0.0003	1.0	1 Haiku classifier → workflow

Winner: Architecture C — 2x faster, 700x cheaper, same accuracy

Test Case 6: "Delta-neutral LP on Uniswap v3" (Medium)

Architecture	Latency	Cost	Correctness	Notes
A: Orchestrator-Workers ⭐	9.0s	$0.25	0.95	Best balance for medium complexity
B: Monolithic Agent	13.0s	$0.96	1.0	Accurate but expensive
C: Workflow-First	13.4s	$0.96	1.0	Routes to agent (no workflow)

Winner: Architecture A — Faster and cheaper while maintaining high accuracy

Test Case 13: "Custom market-making strategy" (Complex)

Architecture	Latency	Cost	Correctness	Notes
A: Orchestrator-Workers	8s	$0.25	0.6	❌ Struggles with novel strategies
B: Monolithic Agent ⭐	20s	$1.60	0.95	Adaptive loop handles complexity
C: Workflow-First ⭐	20s	$1.60	0.95	Falls back to agent

Winner: B/C Tied — Only agent can handle this complexity

Test Case 19: "I'm getting liquidated! Close NOW" (Emergency)

Architecture	Latency	Cost	Correctness	Notes
A: Orchestrator-Workers	4.2s	$0.08	1.0	Orchestrator → Executor shortcut
B: Monolithic Agent	4.6s	$0.32	1.0	Detects urgency in context
C: Workflow-First ⭐	2.6s	$0.0003	1.0	Deterministic emergency handler

Winner: Architecture C — Fastest response critical for emergencies

📈 Aggregated Results (All 20 Test Cases)

Overall Performance Summary

Metric	Architecture A (Orchestrator-Workers)	Architecture B (Monolithic Agent)	Architecture C ⭐ (Workflow-First Hybrid)
Avg Latency	6.8s	9.2s	4.1s 🏆
Avg Cost	$0.22	$0.68	$0.14 🏆
Correctness	0.91	0.97 🏆	0.94
Robustness	0.92	0.98 🏆	0.88

Performance by Task Category

Category	Architecture A	Architecture B	Architecture C ⭐
Simple (1-5)	Good (but overkill) 6.1s, $0.20	Slow & expensive 7.8s, $0.54	Excellent 🏆 3.2s, $0.0003 (10x cheaper, 2x faster)
Medium (6-12)	Excellent 🏆 8.2s, $0.24 (best balance)	Good (accurate but costly) 12.1s, $0.88	Good 8.8s, $0.22 (workflow) $0.89 (agent fallback)
Complex (13-17)	Struggles ❌ 9.5s, $0.27 0.72 correctness	Excellent 🏆 15.4s, $1.12 0.96 correctness	Excellent 🏆 15.6s, $1.13 0.96 correctness
Edge Cases (18-20)	Good 4.8s, $0.12	Excellent 6.1s, $0.45	Excellent 🏆 3.1s, $0.08 (fastest, cheapest)

🎯 Key Insights

Architecture A: The Specialist

Strengths

✅ Clear separation of concerns (easy to debug)
✅ Best for medium-complexity tasks
✅ Moderate cost (Sonnet specialists cheaper than Opus)
✅ Parallelization potential

Weaknesses

❌ Overkill for simple tasks
❌ Struggles with novel strategies
❌ Context loss between handoffs

Best For: Production with well-defined strategy types (80% of cases)

Architecture B: The Generalist

Strengths

✅ Highest accuracy (0.97) and robustness (0.98)
✅ Best for complex/novel strategies
✅ Full context maintained (no handoffs)
✅ Can handle unpredictable tasks

Weaknesses

❌ Most expensive ($0.68 avg, 5x more than C)
❌ Slowest (9.2s avg latency)
❌ Overkill for simple tasks
❌ Harder to debug

Best For: Research/exploration, power users, complex strategies

Architecture C: The Pragmatist ⭐ (WINNER)

Strengths

✅ Fastest (4.1s avg) — workflows skip LLM entirely
✅ Cheapest ($0.14 avg) — LLM only when needed
✅ Best for simple tasks (3.2s, $0.0003)
✅ Best for emergencies (2.6s close position)
✅ Falls back to agent for complex cases (B's flexibility when needed)

Weaknesses

❌ Code maintenance (workflows need updating)
❌ Lower robustness for workflows (0.88)
❌ Hybrid complexity (two code paths)
❌ Classifier mistakes route to wrong path

Best For: Production systems optimizing for cost and speed (SELECTED)

✅ Final Recommendation

Winner: Architecture C (Workflow-First Hybrid)

Rationale

Deploy Terminal serves a wide user base with diverse needs:

80% of requests are common strategies (DCA, lending, grid trading, etc.)
15% of requests are medium complexity (require some reasoning)
5% of requests are novel/complex (custom strategies, edge cases)

Architecture C optimizes for the common case while maintaining flexibility:

Fast & cheap for 80% of requests (workflows)
Falls back to powerful agent for 20% (Opus reasoning)
Best user experience (low latency)
Sustainable economics (low cost per user)

Initial Workflow Coverage (Launch)

20 Predefined Workflows:

Core DeFi (1-5)

Lending
Dollar-Cost Averaging (DCA)
Grid Trading
Liquid Staking
Yield Farming

Risk Management (6-10)

Stop-Loss / Take-Profit
Limit Orders
Portfolio Rebalancing
Collateral Top-Up
Position Close (emergency)

Advanced (11-15)

Yield Rotation
Recursive Lending Loop
Leverage Farming
LP Provision
Arbitrage (simple cross-DEX)

MEV & Governance (16-20)

Sandwich Protection
MEV Blocker Integration
Airdrop Farming
Governance Voting
Token Swap (aggregated)

Agent-Only (Complex Strategies): Market Making, Custom Strategies, Multi-Protocol Composition, Novel DeFi Interactions

📊 Real Benchmark Results Comparison

The simulated data above was our initial planning estimate. Here's how the real benchmarks (33 test cases with actual Gemini models) compared:

Simulated vs Real: Key Differences

Metric	Simulated Prediction	Real Result	Variance
Workflow-First Latency	4.1s average	3.5s average	✅ 15% faster
Workflow-First Cost	$0.14/query	$0.00013/query	✅ 1000x cheaper (wrong unit)
Multi-Agent Accuracy	94% expected	72.7% achieved	❌ Implementation bugs
ReAct Viability	Slow but viable	Catastrophic failure (51% accuracy)	❌ Not production-ready
Safety Compliance	Not measured	Style D: 100%, Style B: 0%	✅ Critical finding

📝 Lessons Learned

Simulations ≠ Reality: Our multi-agent implementation had JSON parsing, timeout, and symbol normalization bugs that simulations didn't capture.
Safety is measurable: Real benchmarks revealed that only code-enforced safety rules (Style D) achieve 100% compliance. LLM-based safety (Style B) completely failed.
Implementation quality matters more than architecture: Style C should have achieved 94% accuracy (per Chotu's benchmark) but our implementation only got 72.7%.
Gemini 1.5 Flash surprised us: Direct function calling (Style A) achieved 81.8% accuracy — better than expected for a single-shot approach.

🔗 Read the full deep analysis with real benchmark data →

📚 Additional Resources

View Real Benchmark Results & Deep Analysis - Empirical data from 33 test cases
View Full Architecture - Technical design details
View Kanban Board - Current project status
Back to Home

Agent Architecture Benchmarks

🆕 Real Benchmark Results Available!

Quick Summary of Real Results:

📋 Simulated Benchmark Methodology (Original Planning Phase)

Test Categories

Evaluation Metrics

📊 Benchmark Assumptions

LLM Performance (Production Anthropic API)

Tool Execution Times

🔬 Detailed Test Case Examples

Test Case 1: "Lend my 1000 USDC on Aptos" (Simple)

Test Case 6: "Delta-neutral LP on Uniswap v3" (Medium)

Test Case 13: "Custom market-making strategy" (Complex)

Test Case 19: "I'm getting liquidated! Close NOW" (Emergency)

📈 Aggregated Results (All 20 Test Cases)

Overall Performance Summary

Performance by Task Category

🎯 Key Insights

Architecture A: The Specialist

Strengths

Weaknesses

Architecture B: The Generalist

Strengths

Weaknesses

Architecture C: The Pragmatist ⭐ (WINNER)

Strengths

Weaknesses

✅ Final Recommendation

Rationale

Initial Workflow Coverage (Launch)

Core DeFi (1-5)

Risk Management (6-10)

Advanced (11-15)

MEV & Governance (16-20)

📊 Real Benchmark Results Comparison

Simulated vs Real: Key Differences

📝 Lessons Learned

📚 Additional Resources