Frontier Performance at 60% Lower Cost Using Agents

Community Article Published March 23, 2026

We used a planner-worker-revision architecture with GLM-4.7, Gemini Flash, and Qwen 3.5 matches Claude Sonnet 4.5 on HAL Princeton and outperforms frontier reasoning models on multi-turn support tasks

The Results

GAIA — single-shot, multi-step reasoning with tools: 74.55% overall, ties for #1 (as of 3/12/26) on the HAL Princeton leaderboard alongside Claude Sonnet 4.5, at 61% lower cost.

τ-2 Telecom — multi-turn conversational support with dual-control tool use: 80.7% overall, outperforming o3 (+22.5pp), GPT-4o (+57pp), and landing just 2.3 points below Claude Haiku 4.5.

No single frontier model powers the Vector Agent. Instead, it orchestrates open-source and lightweight models in a planner-worker architecture, each assigned to the role that best matches its strengths.

GAIA (HAL Princeton Leaderboard)

	Vector Agent	HAL + Sonnet 4.5	HAL + Opus 4.1 High	HAL + GPT-5 Medium
Overall	74.55%	74.55%	68.48%	62.80%
Level 1	88.68%	82.07%	71.70%	73.58%
Level 2	74.42%	72.68%	70.93%	62.79%
Level 3	46.15%	65.39%	53.85%	38.46%
Cost	$69.93	$178.20	$562.24	$359.83

τ-2 Telecom (Selected Comparisons)

	Vector Agent	Claude Haiku 4.5	o3	Kimi K2	GPT-4o
Telecom	80.7%	83.0%	58.2%	65.8%	23.5%
Cost	$36.95	~$30-50	~$200+	—	~$20-40

The Vector Agent achieved the highest Level 1 score on the entire GAIA leaderboard, is Pareto optimal on cost-accuracy, and demonstrates that the planner-worker pattern generalizes across fundamentally different task types.

Why This Matters

There's a prevailing assumption in the AI community: to get frontier-level performance on agentic benchmarks, you need frontier-level models. Our result challenges that assumption.

GAIA (General AI Assistants) is one of the most respected agentic benchmarks in the field. Tasks require multi-step reasoning, web browsing, file analysis, and tool use — the kind of real-world problem-solving that separates capable agents from chat completions. The benchmark is divided into three difficulty levels, with Level 3 tasks requiring deep multi-hop reasoning and complex tool orchestration.

The standard approach on the leaderboard is to pair a single powerful model (Sonnet 4.5, Opus 4.1, GPT-5) with a general-purpose agent scaffold. We took a different approach: decompose the problem into roles, and assign the most cost-effective model to each role.

The key insight is that planning and execution require different capabilities. A planner needs to understand task structure and dependencies. A worker needs to reliably execute a well-defined subtask with tools. A plan reviser needs to diagnose failures and propose alternatives. These are different cognitive profiles, and there's no reason a single model has to do all of them.

Architecture

Model Roles

Gemini 3 Flash Preview — Planner ($9.16) Receives the user question and generates a step-by-step plan as a directed acyclic graph (DAG). Each step specifies the worker type needed, the instruction, expected output facts, and dependencies on prior steps. Gemini Flash is fast and cheap, and planning doesn't require deep domain reasoning — it requires structural thinking about task decomposition.

GLM-4.7 — Primary Worker ($47.67) Handles the bulk of execution: reasoning through subtasks, calling tools (web search, code execution, file analysis), and producing output facts. GLM-4.7 processed 113.5M prompt tokens across 6,738 calls — the workhorse of the system. It's strong on tool use and instruction following, which is exactly what a worker needs.

Qwen 3.5 — Secondary Worker ($12.50) Handles 1,199 calls as a secondary execution path. The routing between GLM-4.7 and Qwen 3.5 is managed by a local router model that assigns tasks based on type and complexity.

GPT-OSS 120B — Plan Reviser ($0.59) Only activated when a worker step is blocked or fails. Analyzes what went wrong and generates a revised plan. Used sparingly (428 calls) but critical for recovery on harder tasks. This is what keeps Level 2 and Level 3 scores competitive — instead of failing on blocked steps, the system adapts.

Key Design Decisions

DAG-based execution. Plans aren't flat lists — they're dependency graphs. Steps execute in topological order, and a step only runs when its dependencies have completed. This naturally handles multi-hop reasoning where later steps depend on earlier results.

Output fact validation. Workers can't claim completion without producing the expected output facts specified in the plan. This prevents hallucinated "I found the answer" responses from propagating through the pipeline.

Iterative plan revision. When a step fails, the system doesn't just retry — it revises the plan. This is cheap ($0.59 total) but high-impact, especially on medium and hard tasks where the initial decomposition may not account for all edge cases.

Local routing and judging. The router and judge run on local models, keeping latency low and avoiding unnecessary API costs for decisions that don't require frontier-level reasoning.

Where It Works — and Where It Doesn't

Level 1: Best on the leaderboard (88.68%)

Level 1 tasks are "breakable by very good LLMs" — they require multi-step reasoning and tool use, but the steps are relatively straightforward. The planner-worker architecture excels here because clean task decomposition leads to reliable execution. When the plan is right, the workers rarely fail.

The 88.68% score is 6.6 points above the next best entry (Claude Sonnet 4.5 at 82.07%). This suggests that architectural advantages — explicit planning, DAG execution, output validation — matter more than raw model capability on well-structured tasks.

Level 2: Competitive (74.42%)

Level 2 tasks are harder and often require recovering from dead ends. This is where iterative plan revision earns its keep. When a web search returns nothing useful or a file isn't in the expected format, the system diagnoses the failure and tries a different approach.

The 74.42% score ties with Claude Sonnet 4.5 High and exceeds all other entries. The gap between Level 1 and Level 2 performance (14.3 points) is smaller than most other agents on the leaderboard, suggesting the revision system helps maintain performance as difficulty increases.

Level 3: The ceiling (46.15%)

Level 3 tasks require deep multi-hop reasoning and complex coordination. This is where the architecture hits its limits. Claude Sonnet 4.5 scores 65.39% on Level 3 — 19 points higher.

The gap likely comes from two sources. First, the individual reasoning depth of GLM-4.7 and Qwen 3.5 on the hardest chains is lower than Sonnet 4.5's extended thinking capability. Second, some Level 3 tasks may resist decomposition — they need a single deep reasoning pass rather than a series of tool-assisted steps.

This is our primary area for future work. Potential approaches include using reasoning-capable models (like DeepSeek R1) as specialized workers for the hardest subtasks, increasing the depth of plan revision, and improving the planner's ability to recognize when a task should not be decomposed.

τ-2 Telecom: A Different Kind of Challenge

GAIA tests single-shot, multi-step problem solving. τ-2 Telecom tests something fundamentally different: can an agent sustain a multi-turn conversation while coordinating actions with a user on a shared environment?

Each τ-2 task simulates a technical support call. The agent and a simulated customer both have access to tools that modify a shared telecom environment — the agent can look up accounts, refuel data, and enable roaming, while the user can toggle airplane mode, grant app permissions, and restart their device. Success requires not just solving the problem, but guiding the user through their part of the solution.

The Vector Agent scored 80.7% across 114 tasks, with an average of 76 messages per conversation.

Adapting the Architecture

GAIA tasks are fire-and-forget: the agent receives a question, works autonomously, and returns an answer. τ-2 is fundamentally interactive — the agent must converse with a user, ask them to perform actions, wait for results, and adjust. This required two key adaptations to the framework.

Deferred tool calling. In GAIA, workers execute tools immediately and return results. In τ-2, some actions can't be completed in a single turn — the agent needs to ask the user to do something (toggle airplane mode, grant a permission), then wait for their response before continuing. We introduced deferred tool calls: when a plan step requires user action, the worker generates the instruction for the user and yields control back to the conversation loop. The plan state is persisted and resumed when the user responds.

Ask-user tool. The planner-worker system was designed around tool calls that interact with APIs and data sources. τ-2 required a new kind of tool: one that interacts with the user. We added an ask_user tool that workers can invoke to request information or actions from the customer. This fits naturally into the DAG — a step can depend on user input the same way it depends on a search result or a computation.

These adaptations didn't require changes to the core planner-worker loop. The DAG execution engine already supported dependencies and state persistence — deferred tool calling just added a new type of dependency (waiting on user response), and ask_user was just another tool in the worker's toolkit. The fact that these extensions were straightforward suggests the architecture is more general than the single-shot GAIA use case it was originally designed for.

How the Architecture Adapts

The same planner-worker system powers both GAIA and τ-2, but the conversation-driven nature of τ-2 activates a different path through the architecture. In GAIA, every task is complex enough to escalate immediately to the planner. In τ-2, many conversational turns are routine — greeting the customer, asking clarifying questions, confirming actions. These are handled by a StandardChat model (gpt-oss:20b) at low cost, with escalation to the planner only when real task decomposition is needed.

This means the model mix shifts naturally by task type:

	GAIA	τ-2 Telecom
Primary worker	GLM-4.7 (6,738 calls)	gpt-oss:20b (4,142 calls)
Planner	Gemini Flash (930 calls)	Gemini Flash (3,379 calls)
Secondary worker	Qwen 3.5 (1,199 calls)	GLM-4.7 (1,590 calls)
Total cost	$69.93	$36.95

The planner fires more often in τ-2 (3,379 vs 930) because each conversation involves multiple planning decisions as the troubleshooting unfolds and the user reports results.

Where It Works

Service issues (82.8%) — Airplane mode, SIM card problems, APN settings. The agent handles these well, with clean plan-execute-verify loops.

Mobile data issues (80.6%) — Roaming, data caps, VPN configuration. Most failures involve compound scenarios requiring data refueling combined with other fixes.

MMS issues (79.6%) — The most complex category, requiring layered troubleshooting across APN settings, app permissions, network mode, and Wi-Fi calling. Failures typically involve 4+ simultaneous issues.

Where It Doesn't

The failure analysis revealed a concentrated pattern: 14 of 22 failures (64%) involved the same root cause — data refueling not completed. The agent would successfully troubleshoot device settings, network configuration, and permissions, but miss the final step of refueling the customer's data balance.

This is a targetable fix — either through a planner-level checklist ("verify all account-level actions completed") or a post-execution validation step. Fixing even half of these would push accuracy to ~87%, above Claude Haiku 4.5.

Other failure modes were rarer: 3 cases where transfer to a human agent should have been triggered, 3 cases of missed account-level roaming enablement, and 2 incomplete payment flows.

What This Tells Us About the Architecture

The τ-2 result demonstrates something important: the planner-worker pattern generalizes beyond single-shot tasks. The same architecture that decomposes GAIA questions into DAGs can also manage multi-turn conversations with shared state, tool coordination, and user guidance — just with a different entry path through the routing layer.

The concentrated failure pattern is also telling. The architecture doesn't fail in diverse, unpredictable ways. It fails systematically, on a specific tool-calling pattern that can be identified and fixed. This is exactly the kind of failure mode you want — diagnosable and addressable.

Cost Analysis

The Vector Agent is Pareto optimal on the HAL GAIA leaderboard and cost-competitive on τ-2.

Benchmark	Agent	Accuracy	Cost	Cost per Correct Answer
GAIA	Vector Agent	74.55%	$69.93	$0.57
GAIA	HAL + Sonnet 4.5	74.55%	$178.20	$1.45
GAIA	HAL + Opus 4.1 High	68.48%	$562.24	$4.97
τ-2	Vector Agent	80.7%	$36.95	$0.40
τ-2	Claude Haiku 4.5	83.0%	~$30-50	~$0.36
τ-2	o3	58.2%	~$200+	~$3.01

Across both benchmarks, the Vector Agent processed over 230M tokens in 20,800+ LLM calls for a combined API-equivalent cost of $106.88.

The actual compute cost is even lower. All models run via an Ollama Cloud subscription at $20/month flat rate. The combined evaluation time (~5 hours) represents approximately $0.76 in proportional usage — making the effective cost per correct answer less than a penny.

What We Learned

Orchestration can substitute for raw capability — up to a point. Through GAIA Level 2 and τ-2 Telecom, careful task decomposition and execution with mid-tier models matches or beats single frontier models. GAIA Level 3 is where this breaks down, suggesting there's a complexity threshold where orchestration overhead exceeds the benefit.

The architecture generalizes across task types. GAIA and τ-2 are fundamentally different — single-shot vs. multi-turn, tool-use vs. dual-control, research vs. customer service. The same planner-worker system handles both, with the routing layer naturally adapting which models handle which turns. This is stronger evidence than a single-benchmark result.

The planner is the highest-leverage component. Switching from a weaker planner to Gemini Flash was one of the biggest single improvements. A bad plan can't be rescued by good workers, but a good plan can be executed by adequate ones.

Recovery matters more than initial accuracy. The iterative plan revision system costs almost nothing ($0.59 on GAIA) but has an outsized impact on harder tasks. Designing for failure recovery rather than first-pass accuracy is underrated.

Systematic failures are good failures. On τ-2, 64% of failures had the same root cause (missed data refueling). This means the architecture fails in diagnosable, fixable patterns rather than diverse, unpredictable ways. That's a hallmark of a well-structured system.

Cost tracking changes the evaluation conversation. HAL's decision to track cost alongside accuracy is exactly right. A 1% accuracy improvement that costs 5x more isn't a real improvement for most use cases.

What's Next

These GAIA and τ-2 results are the first two in a planned series of benchmark evaluations for the Vector Agent design:

DataSciBench — data analysis tasks (HuggingFace dataset, similar pipeline to GAIA)
SWE-bench — real GitHub issues requiring multi-file code editing
GPQA Diamond — graduate-level science reasoning

We're also investigating approaches to close the GAIA Level 3 gap, including specialized reasoning workers and deeper plan revision strategies. On τ-2, the concentrated data-refueling failure pattern is the immediate next target — a straightforward fix that could push accuracy above 85%.

The Vector Agent code and traces are available at:

HAL Leaderboard: hal.cs.princeton.edu/gaia
HuggingFace PR: agent-evals/hal_traces#1

Rob Schieber — Vector Ventures March 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote