The MCP eval gap: why 87% of MCP servers fail high-trust thresholds

The protocol won. The trust layer is missing. The companies that build it before the LLM-eval incumbents pivot will define a category for the next decade.

Imagine you are the CTO at a $50M ARR Indian fintech.

Six weeks ago, your agents team rolled out fourteen MCP servers across the stack: KYC pulls, ledger reads, broker integrations, ops automations, a couple of internal RAG endpoints. Fourteen sounds modest. Your team is happy. Your CEO retweeted the launch. Your compliance officer asked a polite question and was politely answered.

You have just inherited the highest concentration of unreviewed automation in your company’s history, and the published research says roughly one in three of those servers will fail a routine reliability trial, and the median one is exposing at least one sensitive capability you do not know about.

This is not a hypothetical. The numbers are public. Most of the people running fourteen-server deployments are not.

I. The protocol that ate the world in eighteen months

In November 2024, Anthropic released the Model Context Protocol with a deceptively simple pitch: USB-C for AI agents. Any compliant host plugs into any compliant server, capabilities are discovered at runtime, and the integration burden between agents and tools collapses to near zero.

Most protocol releases settle into the long tail of standards that may or may not matter in five years. MCP did not.

By March 2026, researchers had catalogued 67,057 MCP servers across six public registries.¹ Microsoft, Google, OpenAI, and every major AI company adopted the standard. Cloudflare built an entire agent infrastructure stack around it. Every major coding agent (Claude Code, Cursor, Cline, Windsurf, VS Code Copilot) wired MCP as a primary integration mechanism. Registries like Smithery, Glama, and Anthropic’s own reference list now function as the App Stores of the agentic web.

The adoption pace was unprecedented for an infrastructure standard. The user-facing concept (“an agent that can do things”) had product-market fit. The developer-facing concept (“a protocol to expose tools”) had developer-tools fit. The two reinforced each other in a way that few standards in history have managed.

And then the security research started coming in.

II. The numbers that should have been on every front page

The first signal arrived quietly. Hasan and colleagues at Queen’s University published the first large-scale empirical study of MCP server quality in mid-2025.² Within twelve months, four more independent research teams had published in the same space (Pynt, Knostic, two arXiv groups, a dedicated stress-test from a reliability-engineering shop), and the picture they sketched between them was uniform enough that arguing with it was no longer credible.

87%

Fail high-trust thresholds

72%

Expose sensitive capabilities³

52%

Developer-facing servers insecure⁴

100%

Success rate on metadata-spoof attacks⁵

The combined finding, compressed: of any MCP server you select at random from the public ecosystem, you have roughly a one-in-three chance it will fail a routine tool-invocation trial⁶ and a much higher chance it carries a meaningful security flaw. Of 120 deliberately malicious servers generated by one research team, one of the leading open-source scanners caught four.⁵

Six different studies, six different samples, six different methodologies. The aggregate picture is uniform enough that arguing with it is no longer credible.

If we define a “high-trust threshold” as a server that (a) passes 90 percent of standardised reliability trials and (b) implements the four hardening patterns the top-decile servers share (typed I/O schemas, idempotency, explicit cancellation and timeout handling, exponential backoff on transient errors), then approximately 87 percent of public MCP servers fail to meet it.

This is the number that should have been a banner headline. It was not. Adoption ran faster than the audit.

III. The economic cost of the trust gap

It is tempting to read this as a technical problem that will get fixed by version 2 of the protocol, or by maturing best practices, or by the next wave of implementations. That framing is wrong, and dangerously so. The gap is an economic problem with technical symptoms, and the cost is already showing up in three measurable ways.

First, agent accuracy collapses as servers are added. Published research on tool overload shows agent accuracy dropping from 87 percent to 54 percent when context fills with tool definitions. Tool-selection accuracy can drop from 43 percent to below 14 percent as the available tool count climbs.⁷ The agent does not become more capable as it acquires more tools. It becomes less capable.

Second, real exploits are now documented. CVE-2025-49596 (CVSS 9.4) involved arbitrary command execution through unauthenticated MCP Inspector instances. CVE-2026-33032 (CVSS 9.8) allowed authentication bypass on the nginx-ui MCP message endpoint.⁸ These are not theoretical attacks. They are CVEs assigned to working exploits with public PoCs.

Third, the supply chain is unguarded. Researchers found 34 percent of MCP-related incidents involved typosquatting in npm and pip, 28 percent involved upstream dependency compromise, 23 percent involved social engineering through tutorials directing users to malicious repositories, and 15 percent involved marketplace poisoning in IDE-bundled extensions.⁵

Stack the three and you get the actual operating picture: enterprises adopting MCP at scale are getting worse agent performance, accumulating documented security exposures, and inheriting a supply chain that has no meaningful integrity guarantees. The cost is not hypothetical. It shows up as failed deployments, leaked credentials, exfiltrated data, and the slow grind of teams ripping out MCP integrations they shipped six months ago.

This is the gap. It is monetisable.

IV. The land-grab that already happened

If you are looking at the gap and thinking “great, large unsolved problem, I will build the obvious solution”, slow down. The obvious adjacent businesses are already taken.

On April 17, 2026, Cloudflare quietly killed an entire category of potential MCP security startups when it shipped isitagentready.com, a free public scanner that evaluates websites on their readiness for AI agents.⁹ It checks sixteen specific implementations across four scored dimensions: discoverability, content, bot access, and capabilities (including MCP Server Cards, WebMCP, OAuth discovery, A2A and ACP). Free scanner. Free API. Integrated with Cloudflare’s URL Scanner. Updated weekly with Cloudflare Radar adoption data.

If you were planning to launch “the agent-readiness scanner” or “the MCP security scoring service” as a SaaS, that business is now structurally non-viable. The basic capability is a free utility from one of the largest internet infrastructure companies on earth.

On the auto-generation side, the category is similarly locked. Stainless and Speakeasy ship MCP server generation as part of their broader SDK suites, both with substantial venture backing and existing customer relationships. The “build a tool that converts OpenAPI specs to MCP servers” startup angle was contested in 2025 and is now a feature of incumbent tooling.

This is the disappointing part of the analysis. The two most obvious opportunities, scanning and generation, are already addressed. The trust gap remains. Most of the easy ways to monetise it do not.

But the gap is enormous, and the next layer of opportunity is harder to see and harder to build, which is the better kind of opportunity if you can build it.

V. Where the surviving whitespace is

The whitespace is continuous eval and observability for production MCP deployments. Not point-in-time scanning. Not code generation. Continuous instrumentation, scoring, and remediation of the agent–tool interaction surface, the way Datadog does for cloud services and the way Braintrust is starting to do for LLM prompts.

Here is why this is the surviving whitespace.

The scanning problem is one-shot. The eval problem is continuous. isitagentready.com tells you whether a website implements certain agent standards at a point in time. That is a useful diagnostic, but it answers the wrong question for the enterprise running MCP servers in production. The production question is: does this server behave correctly when our agents call it, under our load, with our tool combinations, across our users, today?

That question requires lossless capture of the full agent-tool interaction (request, response, the agent’s reasoning trace, downstream effects), scored against reliability and security baselines, with anomalies surfaced before they become incidents.

The interaction model breaks both APM and LLM eval. Traditional APM assumes deterministic service behaviour. MCP is non-deterministic by design: the agent decides which tools to call, in what order, with what arguments, based on natural language. LLM eval, conversely, assumes you control the model and the prompts. With MCP you control neither the model nor the prompts. You control the tool layer, and the eval has to work with everything else as a black box.

This is a genuinely new shape of problem. It does not fit cleanly into any existing category. Which is exactly the configuration in which a new category gets built.

The buyer is sophisticated and budget-rich. Enterprises rolling out MCP at scale are large companies (financial services, healthcare, SaaS, regulated platforms) with substantial compliance obligations and substantial budgets. Pricing is enterprise-shaped: $50,000 to $500,000 ARR per customer. Customer count is small but unit economics are excellent.

The data moat compounds. Every eval deployment generates telemetry that improves the next deployment. Benchmarks sharpen, anomaly detection sharpens, recommendation engines sharpen. This is the same dynamic that made Datadog defensible. The first observability vendor to scale meaningfully will accumulate a data advantage that is hard for new entrants to match.

VI. The product, specifically

Sketching the architecture clarifies the bet.

Layer 1 · Telemetry capture

A lightweight SDK or proxy between the agent host and the MCP server. Captures every tool invocation with full context, reasoning trace, prompt, response, downstream effects. Lossless and async. Cloudflare Workers + Durable Objects + R2 at scale.

Layer 2 · Reliability + security scoring

Schema conformance, response time, error rate, security posture compressed into a per-server score. The four hardening patterns. Tool-selection correctness. Aggregated, this is the metric the customer pays for.

Layer 3 · Anomaly + alerting

Baseline drift detection. A 95 percent success rate slipping to 78. A 200ms tool slowing to 1.4s. A user generating tool-call patterns outside the population norm. PagerDuty + Slack + dashboards.

Layer 4 · Compliance + audit

For regulated industries, audit-ready reports of what MCP servers were called, what data they accessed, what was returned, who authorised it. The layer that justifies the price.

Layer 5 · Recommendations + remediation

”Server X has a 38 percent failure rate; here is the schema fix.” “Replace this server with Y for 40 percent better reliability.” Where passive monitoring becomes active improvement, and where lock-in begins.

Layer 6 · Marketplace ratings

Aggregate telemetry across customers (privacy-bounded) becomes public reliability and security ratings of MCP servers. G2 for MCP. The eval vendor sits in the middle of the trust graph and compounds.

Each layer is buildable today. None of them is shipped by anyone at the integration depth the buyer needs. The first vendor that ships layers 1–3 cleanly, on real customer traffic, owns the wedge. Layers 4–6 are where the moat compounds.

VII. The Indian fintech wedge

There is a specific vertical that deserves to be called out separately, because it is where the eval product has its sharpest commercial fit and where I have personally watched the gap open from the inside.

Indian financial services operate under a layered regulatory regime, RBI for banks and NBFCs, SEBI for capital markets, IRDAI for insurance, and the Digital Personal Data Protection Act (DPDPA) cutting across all of them. Each regulator has specific requirements around data handling, auditability, system integrity, and breach notification.

When an Indian fintech starts deploying MCP, it collides with this regime in specific ways:

RBI’s IT framework requires logging of all access to customer data, including by automated systems. An MCP server calling a customer data store is, in regulatory terms, an automated access event that must be logged in tamper-evident form.
SEBI’s framework for automated trading systems treats any system that can place trades, including agent-driven systems, as requiring specific licensing and audit infrastructure.
DPDPA requires explicit consent management for personal data processing, with specific provisions for automated decision-making. An MCP-mediated agent making decisions on customer data must produce consent traceability.
The IT Act and CERT-In rules impose breach notification timelines that require operational telemetry sufficient to detect breaches in hours, not weeks.

None of the global MCP eval products that will eventually emerge will be tuned to these specifics. The category is small enough that international vendors will not vertical-specialise here. The compliance complexity is high enough that horizontal products will struggle to serve regulated Indian customers without an integration layer that is, itself, a separate product.

The math: India has roughly 2,000 NBFCs, 30 commercial banks, 150 broking firms, 50 insurance companies, and a long tail of fintech startups. Even capturing 5 percent of regulated entities at $50,000 ARR each produces $5–7M ARR within eighteen months. At $150,000 ARR for larger institutions the addressable revenue is materially higher. This is not a moonshot; it is a focused vertical SaaS play with a clear ICP and a regulatory tailwind.

I am writing this from the inside of two adjacent infra builds, Findable (agent SEO) and AIGateway (unified LLM gateway), and the MCP trust gap is the third side of the same problem I keep running into. The macro context that is making Indian fintech buyers especially price-conscious about every new compliance line item right now, twin energy and monsoon shocks compressing margins from both ends, is the subject of a separate piece that landed this week. Read together, those two pieces explain why a focused MCP eval product in Indian fintech has both the technical justification and the commercial timing right now.

VIII. The competitive landscape, mapped honestly

Before recommending anyone build in this space, it is worth being honest about who is moving.

Anthropic has interest at the protocol layer (security guidance, OWASP MCP Top 10 contributions, reference servers) but not at the application layer, shipping an MCP observability product would put them in conflict with their own model-API customers.

Cloudflare has shipped isitagentready.com (scanning), the AI Gateway (inference observability), and the broader agent infrastructure stack. They will eventually enter MCP eval directly; the current focus is the primitives layer. Two-to-four-quarter window before they move.

Datadog, New Relic, Honeycomb. Traditional APM. Each has shipped LLM observability in 2026. MCP eval is a natural extension. Advantage: enterprise procurement relationships. Disadvantage: APM vendors are notoriously slow to adapt to new paradigms, Datadog took four years to ship credible serverless support after Lambda launched.

Braintrust, Helicone, Langfuse, Arize Phoenix. LLM eval startups. Most natural adjacent move. Braintrust is closest to MCP-aware product strategy. Advantage: existing customer overlap with MCP adopters. Disadvantage: optimising for “improve your prompts” rather than “evaluate your agent-tool integrations”. Different shape of problem.

Pynt, Knostic, AppOmni. Security research firms with published MCP work. Each could productise. Advantage: security expertise, existing buyer relationships. Disadvantage: security-first positioning may cap them at a subset of the eval use case.

The honest read: the space is open but not empty. The likely winners are LLM eval startups extending into MCP, with Cloudflare and Datadog entering later. A focused startup that ships before the LLM eval incumbents fully turn their attention has a 12–18 month window. A verticalised play (Indian fintech, healthcare, financial services compliance) has a longer window because horizontal competitors will be slow to verticalise.

IX. Why the incumbents will be slow

A pattern worth internalising: the most valuable observability windows in software history have opened when a new paradigm makes old observability tools structurally inadequate. Datadog won the cloud-native observability transition because legacy APM was built for static infrastructure. Sentry won frontend error tracking because backend-first APM did not see browser errors as a priority.

MCP creates a similar structural inadequacy. The existing observability stack is built for one of two paradigms, deterministic service calls or stateless LLM completions. Neither captures the MCP pattern, where an agent decides to call a tool based on natural-language reasoning, the tool is selected from a set of dozens, the invocation has both deterministic and non-deterministic success criteria, and the response is consumed by the agent in ways that recursively trigger further tool calls.

Legacy APM vendors will treat MCP as “just another service call” and miss the structural specificity. LLM eval vendors will treat MCP as “just another integration to instrument” and miss the depth of what needs to be captured. Both will eventually catch up. Both will lose 12–18 months in the catching up. Any startup that ships during that window with a clearly differentiated product can build the data moat that compounds.

This is the same pattern that has played out in every observability category. Recognising it early is the entire bet.

X. What to build, in priority order

For someone considering a serious bet, this is the build order.

Phase 1 · Months 1–3

Traffic capture + single high-value metric

Ship the SDK/proxy. Compute one MCP reliability score nobody else computes well. Five to ten design partners. Charge nothing. Build the data foundation.

Phase 2 · Months 4–6

The compliance wedge

Indian fintech first, RBI/SEBI/DPDPA reporting templates. Five customers at $40–80K ARR each. $200–400K ARR. Commercial model proven.

Phase 3 · Months 7–12

Marketplace + recommendations

Public reliability ratings of major MCP servers. Recommendation engine. $80–250K ARR per customer. Month-12 target: $1.5–3M ARR.

Phase 4 · Months 13–24

Horizontal expansion

Global FS compliance (SOC2, EU AI Act, UK PRA) and healthcare (HIPAA, GDPR-health). $150–500K ARR per customer. Month-24 target: $8–15M ARR.

Phase 5 · Months 25–36

The platform play, trust layer for the entire MCP ecosystem

Marketplace ratings for 1,000+ servers. Eval coverage for the major hosts. Compliance templates for 5+ regulatory regimes. Recommendation engines tuned against real outcomes. Verisign for agent-tool integrations. Aggressive target: $50M+ ARR. Conservative: $15–20M ARR with substantial enterprise contracts.

XI. The contrarian summary

Compressing the thesis:

MCP has produced extraordinary protocol adoption (67,000+ servers in 18 months) while accumulating an extraordinary trust gap (87 percent failing high-trust thresholds, 72 percent exposing sensitive capabilities, 52 percent of developer-facing servers insecure).
The obvious adjacent businesses are already locked up, Cloudflare owns scanning, Stainless and Speakeasy own auto-generation. The easy startup wedge is gone.
The surviving whitespace is continuous eval and observability for production deployments, especially in regulated verticals where compliance reporting creates real willingness to pay. A focused startup that ships in the 12–18 month window before LLM eval incumbents and APM vendors pivot can build a category-defining business with strong data moats.

This is the bet.

Every new protocol layer in computing history has produced a trust crisis followed by a trust infrastructure boom. TCP/IP produced the firewall industry. HTTP produced TLS/SSL infrastructure. DNS produced the registrar industry. Email produced spam filtering and DKIM/DMARC. Mobile apps produced MAM/MDM. MCP is producing the same crisis now. The infrastructure layer that emerges to address it will be one of the most valuable categories in agent infrastructure. Most of it has not been built yet. Some of it cannot be built yet, the protocols are still maturing, the ecosystems are still forming, the regulatory frameworks are still being drafted.

But the surface area is increasingly visible. Go back to the CTO at the start of this essay, the one running fourteen MCP servers without an eval layer. That CTO is the buyer. The CFO behind that CTO is the budget. The compliance officer behind both of them is the urgency.

The MCP ecosystem cracked open eighteen months ago. The trust gap is being measured by researchers but not yet addressed by markets.

That gap is the opportunity. Someone will fill it.

Sources

”Toward Understanding Security Issues in the Model Context Protocol Ecosystem.” arXiv 2510.16558. Catalogue of 67,057 servers across six registries.
Hasan, M. M., Li, H., Fallahzadeh, E., Rajbahadur, G. K., Adams, B., & Hassan, A. E. (2025, updated April 2026). “Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers.” arXiv 2506.13538.
Yosef, G. at Pynt. Analysis of 280+ popular MCP servers, late 2025. Coverage in deeplearning.ai “The Batch."
"Give Them an Inch and They Will Take a Mile: Understanding and Measuring Caller Identity Confusion in MCP-Based AI Systems.” arXiv 2603.07473. 87 widely-used open-source MCP projects analysed.
”Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents.” arXiv 2601.17549. See also Help Net Security, “When trusted AI connections turn hostile,” October 16, 2025.
Digital Applied. “100 MCP Servers Stress-Tested: Reliability Findings.” April 2026. digitalapplied.com.
Datastealth. “MCP Security: 6 Risks Enterprise Teams Face in 2026.” Citing published research on tool overload effects in LLM agents.
SentinelOne. “Model Context Protocol (MCP) Security: Complete Guide.” April 2026. AuthZed, “A Timeline of Model Context Protocol (MCP) Security Breaches,” updated April 2026.
Cloudflare blog. “Introducing the Agent Readiness score.” April 17, 2026. blog.cloudflare.com/agent-readiness.