Every MCP tool call was flooding our context window with raw JSON. By shifting to server-side code execution, we reduced token consumption by 95-99% across our energy industry API integrations.
The Problem: Death by Documentation
If you're building AI agents that talk to APIs, you've probably run into the same wall we did: token bloat. Every tool call returns massive JSON payloads that flood your context window, burn through your budget, and slow everything down. We fixed it by fundamentally rethinking how our MCP server handles API interactions — shifting from traditional tool calling to server-side code execution. The results were dramatic.
PatchOps is our unified MCP server that gives AI agents access to energy industry APIs — Corva for drilling operations, Enverus for market intelligence, WellDatabase for well data, GeoForce for device tracking, and Microsoft 365 tools like Outlook, Teams, and Planner.
In the original architecture, every MCP tool call followed the standard pattern: Claude sends a request, the server calls the API, and the full response gets stuffed back into the context window. Simple enough. Except it was hemorrhaging tokens.
The worst offender? Every single PatchOps response included an "LLM Connector Guide" — documentation for all connectors — regardless of which API you were actually querying. Ask for one Outlook email? Here's 8,000+ tokens of documentation for WellDatabase, GeoForce, Corva, Enverus, and Snowflake too.
Wake-up call: In one real conversation, retrieving a few emails consumed 68,000 tokens. Only about 2,000 of those were actual useful data. That's a 2.9% signal-to-noise ratio.
A single GeoForce query returning 118 devices would dump ~47KB of raw JSON into the context — roughly 12,000 tokens of coordinates, battery statuses, timestamps, and metadata that Claude then had to parse through just to answer "how many active devices do I have?"
The Fix: Let the Server Do the Work
Instead of returning raw API responses for Claude to digest, we moved to a code execution model. Claude writes a small JavaScript snippet describing what it wants, the PatchOps server executes it against the API, and only the processed result comes back.
Before:
Claude → "list all GeoForce devices" → PatchOps → GeoForce API → 47KB raw JSON → Claude context
Result: ~12,000 tokens consumed
After:
const devices = await geoforce.listDevices();
return {
total: devices.length,
active: devices.filter(d => d.state === 'ACTIVE').length
};
// Result: ~150 tokens consumed
The AI generates targeted code that fetches, filters, and aggregates on the server side. Only the answer comes back — not the raw material.
The Numbers
We ran extensive benchmarks comparing both approaches across real workloads.
Token Reduction by Scenario
| Scenario | Tool Calling | Code Execution | Reduction |
|---|---|---|---|
| GeoForce: 118 devices | 12,000 tokens | 150 tokens | 98.75% |
| Outlook: Email retrieval | 68,000 tokens | 1,726 tokens | 97.5% |
| Corva: Rig list | 8,000 tokens | 1,107 tokens | 86% |
| Enverus: Basin query | ~8,000 tokens | 1,361 tokens | 83% |
| Multi-API dashboard | 24,000 tokens | 300 tokens | 98.75% |
The GeoForce case is the clearest illustration. A query that previously returned 47KB of device telemetry now returns a 464-byte summary. That's a 100x reduction in context size.
For the Outlook case, the improvement was even more striking in absolute terms. We went from 68,000 tokens — where 48,000 were wasted on connector documentation repeated six times — down to 1,726 tokens of actual email content. A 39x improvement in efficiency.
Cost Impact
At Claude API pricing ($0.80/1M input tokens), the per-call savings look small but scale fast:
- Per call: $0.0096 → $0.00012 (99% reduction)
- Monthly (30 calls/user): $0.288 → $0.0036
- Annual across 3 APIs, 50 users: ~$512/year saved
These numbers get more interesting when you factor in that the reduced context also means faster responses and fewer cases where conversations hit context limits — which means fewer retries and restarts.
Speed Tradeoff
There is one tradeoff: latency. Direct tool calls return in ~100-120ms. Code execution takes 6-7 seconds because it includes runtime boot, API call, and server-side processing.
| Method | Latency | Tokens | Best For |
|---|---|---|---|
| Direct tool call | ~120ms | 12,000 | Single lookups, real-time dashboards |
| Code execution | ~6,000ms | 150 | Bulk operations, analytics, reports |
For a single device lookup where a user is waiting, 120ms wins. For pulling a fleet summary, generating a report, or orchestrating across multiple APIs, the 6-second wait is negligible compared to the 98% token savings.
The Hybrid Approach
We didn't go all-in on either method. PatchOps now uses a hybrid strategy:
Code execution for bulk data, aggregations, multi-API orchestration, filtering, and any operation where you don't need every field from every record. This is the default path for analytics-style queries.
Direct tool calling for single-item lookups, time-critical operations, and cases where complete raw data is needed for downstream processing.
Routing heuristic: If the query smells like "get me this specific thing," use a direct call. If it smells like "tell me about all the things," use code execution.
What Actually Changed Architecturally
The shift required three things:
- Wrapping each API connector as an executable context. Each connector (Corva, Enverus, GeoForce, Outlook, etc.) is available as a pre-authenticated client inside the code execution sandbox. Claude doesn't need to know about auth tokens, endpoints, or pagination — it just calls
await corva.getRigs(). - Eliminating the connector guide from responses. The biggest single win was removing the 8,000-token documentation payload that shipped with every response. Connector method signatures are now discovered through a lightweight
getConnectorDocscall (~50 tokens) that only returns the relevant connector. - Server-side JavaScript execution via Azure Functions. The code runs in a sandboxed runtime with access to pre-configured API clients. Boot time is ~5ms, runtime initialization ~150ms, and the actual API call typically ~90ms. The bulk of the 6-second latency is data processing for large result sets.
Lessons Learned
Token cost is the real bottleneck, not latency. We initially worried about the 6-second execution time. In practice, nobody cares about 6 seconds when they're asking for a fleet summary or generating a report. But everyone cares when their conversation hits the context limit halfway through a workflow because earlier API calls ate 68,000 tokens returning documentation nobody asked for.
The 35ms MCP protocol overhead is irrelevant. We spent time benchmarking whether calling functions directly vs. through MCP tool protocol made a difference. It's 35ms. Not worth optimizing when your real savings are in the 98% token reduction column.
Useful data ratio matters more than total token count. The Outlook conversation was the wake-up call. 68,000 tokens consumed, 2,000 tokens of useful data. That's a 2.9% efficiency rate. After code execution, we're consistently above 90% useful data in every response.
Let the server aggregate. The instinct with LLM tool use is to bring data to the model and let it reason over raw information. For structured API data, that's backwards. The server can filter, count, and summarize far more efficiently than burning tokens to have the model parse JSON.
Bottom Line
Moving from traditional MCP tool calling to server-side code execution gave us a 95-99% reduction in token consumption across our energy industry API integrations. The architecture is simple: pre-authenticated API clients in a sandboxed runtime, lightweight method discovery, and letting Claude write targeted extraction code instead of drowning in raw payloads.
If you're building MCP servers that wrap data-heavy APIs, the code execution pattern is worth serious consideration. The token savings alone justify the architectural shift — and your users will thank you when their conversations stop hitting context limits mid-workflow.
