PatchOps is building a connector eval system that applies the same task-and-verify loop behind modern AI training to continuously improve the quality of our MCP tools for the energy industry.
The models powering AI coding assistants got dramatically better over the last two years. The reason isn't bigger datasets or more parameters — it's reinforcement learning. Give a model a task, let it try, check whether the output actually works, and use that signal to get better. Repeat millions of times.
Code is the ideal domain for this because verification is cheap and unambiguous. Does it compile? Do the tests pass? That binary signal — pass or fail — is all you need to close the loop.
At PatchOps, we're not training models. We're building the tools that models use — MCP connectors that give AI agents access to real-world data across the energy industry. Regulatory filings from the Texas Railroad Commission. Well records from WellDatabase. Production analytics from Corva. Market data from ERCOT, CAISO, and EIA. Environmental monitoring from USGS, EPA, and NOAA. Pipeline intelligence from Enverus. Fleet tracking from Samsara and Geoforce. Over 50 connectors and growing.
The same RL principle applies to all of them: if you can define what "correct" looks like and verify it automatically, quality only goes up.
The Problem: Scale Creates a Quality Challenge
When you have a handful of tools, you can test them manually. When you have hundreds of tools spread across 50+ connectors — each with different APIs, auth models, data formats, and domain semantics — manual testing doesn't scale.
Every connector has its own quality questions:
- Does WellDatabase's
searchWellsreturn the fields an AI agent needs — well name, API number, operator, coordinates, lateral length? - Does Corva's
getWellDetailsinclude real-time drilling data when a well is active? - Does ERCOT's
getGridConditionsreturn current load data within 5 seconds? - Does the RRC connector handle 977,000+ well records without timeouts?
- Does Enverus return production data with the right units and date formats?
- Do environmental connectors (USGS water, EPA air quality, NOAA weather) return valid GeoJSON?
- When you add a new dataset to one connector, did you break another?
Without a systematic way to answer these questions across every connector, you're relying on user reports to find problems. That's reactive. We wanted to be proactive.
The Solution: Eval Suites for Every Connector
We built a connector eval system that borrows directly from the RL playbook: task + verifier, pointed at our tools instead of at model outputs.
An eval suite is a collection of test cases grouped by connector. Each case specifies:
- A tool to call — e.g.,
searchWellson WellDatabase,getGridConditionson ERCOT,searchInspectionson RRC - Input arguments — the exact parameters an AI agent would pass
- Assertions — verifiable claims about what the response should look like
The assertion engine supports nine verification types that cover the patterns we care about across all connectors:
no_error — the tool call succeeded
has_data — a path in the response has non-empty data
field_exists — a specific field exists (e.g., data[0].wellName)
field_equals — a field matches an expected value
field_contains — a field contains a substring
count_gte — an array has at least N items
count_lte — an array has at most N items
response_time_ms — response completed within a time budget
response_shape — response has the expected top-level keys
When you hit "Run All," each case executes against the real connector — same code path as a Claude or ChatGPT user calling the tool. The assertion engine checks every claim and records pass/fail with detailed diagnostics.
What This Looks Like Across the Platform
Every connector category has its own quality concerns. The eval system handles all of them with the same framework.
Oil & Gas Data Connectors
WellDatabase — search returns wells with complete metadata; operator queries match known operators; production data has valid numeric fields
Corva — real-time drilling data includes WIT streams; well details have survey and completion data; rig assignments resolve correctly
Enverus — production queries return data with correct units; lease records include all required regulatory fields
RRC — 37 tools covering wells, permits, inspections, pipelines, gas plants, and production across 977K+ records
Energy Markets
ERCOT — grid conditions return current load and frequency; LMP prices have valid node identifiers; fuel mix totals sum correctly
CAISO — day-ahead prices return for requested dates; renewable curtailment data includes MW values
EIA — petroleum supply data matches known reporting periods; natural gas storage reports have regional breakdowns
Environmental & Geospatial
USGS Water — streamflow queries return valid gauge data with timestamps; site lookups resolve by HUC code
EPA / AirNow — air quality index returns current readings; monitoring station data includes lat/lng
NOAA / NWS — forecast data returns for valid coordinates; historical weather has temperature and precipitation fields
Wetlands / Floodzone / Soils — spatial queries return valid GeoJSON with proper feature properties
Enterprise & Productivity
Snowflake — query execution returns results with correct column types; schema introspection lists all tables
Samsara / Geoforce — fleet queries return vehicle positions with timestamps; geofence lookups resolve correctly
GitHub — repository queries return valid issue and PR data; project board cards have correct status fields
The same nine assertion types work everywhere. has_data verifies a Corva drilling response just as well as an EPA air quality reading. response_shape validates an ERCOT grid response the same way it validates a WellDatabase search.
The Feedback Loop
The real power isn't in running evals once — it's in the loop they create across the entire platform:
Today this loop is manual — you see a failure, open the handler code, fix it, re-run. But the architecture supports automation at every step. A failing eval can be fed to an AI agent along with the handler source code to propose a fix. Evals can run on every pull request so connector quality never regresses.
When we recently audited our RRC connector, the eval system immediately surfaced that 26 of 37 tools returned empty results (ETL still loading), 2 had a date parsing bug, and 9 were fully operational. That's the kind of visibility you need when you're scaling a platform — not guessing, knowing.
Where We're Headed
We're building eval suites for every connector on the platform, starting with the highest-traffic ones and expanding outward. The roadmap:
- CI-gated evals — every PR that touches a connector handler must pass its eval suite before merging. Break a WellDatabase search? The PR is blocked.
- AI-assisted diagnosis — failing evals automatically generate fix proposals by analyzing the handler code against the expected output
- Coverage dashboard — track which connectors and tools have evals, which don't, and prioritize the gaps
- Cross-connector consistency — ensure that patterns like pagination, error handling, field naming, and GeoJSON output are consistent across all 50+ connectors
- Regression detection — when an upstream API changes (ERCOT updates their schema, WellDatabase adds a field), the eval catches it before users do
The insight from RL applies directly: if you can define what "correct" looks like and check it automatically, you can improve continuously. We're applying that principle to every connector on the PatchOps platform — oil and gas, energy markets, environmental data, enterprise tools — and the result is an ecosystem of AI tools that gets more reliable with every iteration.
The eval system is live on PatchOps today. If you're building MCP tools for any industry and want to see how assertion-based verification can improve your connector quality, we'd love to talk.
