MCP Server Design and Token Efficiency Report

A practical view of MCP server performance for software teams

As MCP moves from experiment to production consideration, software teams need more than protocol awareness. They need to know how server design affects cost, reliability and the quality of AI-assisted outcomes in practice.

This benchmark explores a simple but important question: what happens when identical business questions are asked through different connection methods? Across 30 controlled configurations, Cyclr tested three SaaS systems, two models and three connection approaches to measure the one metric AI product teams feel fastest: token consumption. The result is clear. Server design has a major impact on cost, and the wrong design can also reduce reliability.

Focusing on Thick and Thin MCP Servers, against Direct API calls (CLI), this report looks at token counts and result accuracy.

This report is built for product, platform and engineering teams evaluating how to expose actions and data through MCP without creating unnecessary token overhead, bloated tool surfaces or fragile AI behavior. It shows why a well-scoped, typed MCP server can outperform both thicker MCP designs and raw direct API access

Read the full report below or download your own illustrated PDF version.

The Report Below Looks Into:

Comparison of Thick MCP, Thin MCP and Direct API connection methods

Evidence that connection method can swing token cost by up to 4X

Why output tokens are not the main cost driver in MCP workflows

Benchmark findings on tool count, schema overhead and response bloat

Why raw Direct API access was not the cheapest option in practice

Evidence that trimming tool sets reduced cost without reducing accuracy

Independent research supporting the link between tool surface area, token usage and model performance

Six practical rules for designing a more efficient MCP server

1. Executive Summary

Anthropic’s Model Context Protocol (MCP) has become the default way to connect language models to business systems, but the way an MCP server is designed has a first-order effect on cost, speed, and reliability. To quantify that effect, we ran 30 controlled tests against three widely used business applications — HubSpot, Oracle NetSuite, and QuickBooks — using two leading models and three different ways of connecting them to the underlying APIs.

The headline finding is unambiguous: how many tools you expose matters more than which model you choose. Scoping an MCP server to only the endpoints a task needs (a “Thin” server) cut token consumption by roughly 75% versus exposing the full API surface (a “Thick” server), while preserving — and in several cases improving — answer quality.

75%

fewer tokens with a Thin MCP server vs. a Thick one

~2×

more tokens used by Claude Haiku than GPT-5-mini on Thick servers

33%

clean first-answer accuracy for raw Direct API access

Three takeaways for anyone building an MCP server:

Expose few, task-relevant tools. Every tool definition is loaded into the model’s context on every request. Thick servers paid that cost on each turn; thin servers did not.
Don’t assume “direct API” is cheaper. Removing the typed tool schema forced models to discover parameters by trial and error, producing retry loops, higher-than-expected token bills, and the worst task-completion rate of the three methods.
Design endpoints around real questions. The one question every configuration failed (NetSuite “base price”) failed because the value lived in an unexposed sub-resource — a connector design gap, not a model limitation.

2. Project Background and Setting

2.1 Objective

The goal of this test was to identify the most token-efficient way to let an AI assistant answer real business questions against SaaS systems of record, and to translate the result into concrete design guidance for teams building MCP servers. Token efficiency was the primary metric because input tokens are billed on every turn, consume a finite context window, and are the single largest lever on the running cost of an AI agent.

2.2 The test matrix

Each test was defined by three parameters. Crossing them produced the 30 configurations analysed in this report.

Parameter	Options tested
Applications	HubSpot (CRM) Oracle NetSuite (ERP) QuickBooks (Accounting)
Model	Claude Haiku 4.5 GPT-5-mini
Connection method	Thick MCP (Large API surface, ~24–38 tools) Thin MCP (task-scoped, ~4–7 tools) Direct API (CYCLR “Data on Demand”, no MCP tool layer)

“Thick” and “Thin” describe the same protocol with a different surface area. A Thick server advertises every endpoint of the connector (for example, 34 NetSuite operations); a thin server advertises only the handful of read/update endpoints the question set requires.

Direct API removes the MCP tool layer entirely: the model calls a generic CYCLR method and must work out the right method ID and parameters itself.

2.3 Question sets and ground truth

Each application was probed with two-question sets that mix a read (“how many…” / “what is…”) with a write or ranking task. Known-correct answers let us score answer quality independently of token cost.

Application	Representative question	Correct answer
HubSpot	How many companies / contacts in CRM? + update a contact’s email	5 companies / 9 contacts; update succeeds
NetSuite	Total amount of Sales Order 26744? Email for ABC Corp? Base price of item 128? PO 139 date?	£52.80; abc@acme.com; 10; 8 May 2026
QuickBooks	How many customers? Top five vendors by balance?	39 customers; Hall Properties, Diego’s, Robertson, Norton, Brosnahan

2.4 How tokens were measured

Every session report captured input tokens, output tokens, and the total for the full two-question exchange, along with the exact tool list exposed to the model. We analyse the total per task and, where relevant, the input/output split. All 30 raw session logs underpin the figures in this report; the complete per-test table appears in Appendix A.

3. Headline Results

Averaged across all three applications and both models, the connection method alone moved the token bill by more than 4×. Thin MCP was the cheapest at 21,780 tokens per task; Thick MCP was the most expensive at 87,997 tokens; Direct API landed in between at 34,426 tokens — higher than many would expect, for reasons explored in Section 4.3.

Figure 1. Average tokens per task by connection method (28 measured runs).

Output tokens were almost irrelevant to the cost story: across every run, model output accounted for just 2.6% of all tokens consumed. The cost of an AI agent in these workflows is overwhelmingly the cost of context — what you put in front of the model — not what it writes back. That is precisely the cost an MCP server’s design controls.

4. Detailed Token-Usage Analysis

4.1 Where the tokens go

Two distinct costs drive context size in an MCP workflow. The first is schema bloat: every tool a server exposes must be described to the model — a name, a natural-language description, and a full JSON schema for its parameters — and most MCP clients load all of those definitions into context on every single request.

The second is response bloat: when a tool returns raw JSON (a full “list all customers” payload, say), that data also flows back through the context window.

Our data shows both at work. Thick servers paid schema bloat up front (24–38 tool definitions on every turn) and then frequently triggered response bloat as well, because exposing a “list all” endpoint invited the model to call it and pull large payloads into context. Thin servers suppressed both: fewer definitions to load, and a constrained tool set that steered the model toward narrow, targeted calls.

4.2 Thick vs. Thin MCP: the 75% saving

The clearest result in the study is the cost of surface area. Scoping the server down from the full API to a handful of relevant endpoints reduced tokens by about three-quarters for both models.

Token cost by model and MCP server design chart

Figure 2. Average tokens by model and server design (MCP runs).

Model	Thick MCP	Thin MCP	Reduction
Claude Haiku 4.5	117,563	27,954	76%
GPT-5-mini	58,430	15,605	73%
Blended average	87,997	21,780	75%

The relationship is direct with the number of exposed tools: the more an MCP server advertises, the more it costs to use, regardless of which model is on the other end.

Figure 3. Token cost rises with the number of tools the server exposes.

4.3 The Direct API surprise: a retry tax

Direct API access starts with the smallest possible context — there are no tool definitions to load at all — so it ought to be the cheapest option. It was not. At 34,426 tokens per task it cost 58% more than Thin MCP. The logs explain why: without a typed schema, the models had to guess method IDs and parameter names, hit errors such as “the method requires a PurchaseOrderId parameter,” and then loop through corrections and clarifying questions. Several NetSuite Direct runs burned 40,000–60,000 tokens this way and still did not complete the task. We call this the retry tax: the tokens you save on schema, you spend several times over on trial-and-error — and you still get a worse answer.

4.4 Model differences: Claude Haiku vs. GPT-5-mini

On identical Thick servers, Claude Haiku 4.5 consumed roughly twice the tokens of GPT-5-mini (117,563 vs. 58,430 on average). Two factors plausibly contribute: differences in how each provider tokenises and represents the same tool schemas, and behavioural differences in retrieval — Claude tended to pull and enumerate full record lists (returning detailed, itemised answers), whereas GPT-5-mini more often returned a terse figure. The behavioural difference cuts both ways on quality (Section 5). The practical implication is that a wasteful server design is more punishing on some models than others, so the saving from a Thin design is largest exactly where the bills are highest.

4.5 Per-test breakdown

Ranking every measured run from cheapest to most expensive shows the three methods cleanly stratified: the Thin-MCP runs dominate the low end, the Thick-MCP runs occupy the top, and Direct-API runs scatter through the middle depending on how badly the model struggled with parameter discovery.

(i.e. NS-4 mean the Testing Set 4 for Oracle NetSuite)

Figure 4. Per-test token consumption, sorted (NS = NetSuite, HP = HubSpot, QS = QuickBooks).

5. Accuracy and Reliability

Cheaper is only better if the answer is still right. We scored each task’s first factual question (Q1) as cleanly correct only when the model returned the exact ground-truth answer without hedging or failing to complete.

Figure 5. Clean first-answer accuracy by connection method.

Thick and Thin MCP tied at 70% clean-correct — confirming that trimming the tool set did not cost accuracy. Direct API trailed badly at 33%, dragged down by the same parameter-discovery failures that inflated its token cost. The combination is decisive: Direct API was simultaneously more expensive than Thin MCP and far less reliable.

5.1 The NetSuite “base price” case: a design lesson, not a model failure

Every single configuration — thick, thin, direct, both models — failed to return the base price of item 128. In each case the model correctly retrieved the item, reported its cost (£10), and then explained that the sales/base price lived in a separate price sub-resource it could not reach. This is the most instructive failure in the study: the models behaved sensibly; the data simply was not exposed where a natural-language question would look for it. The fix is on the server side — surface the fields and endpoints that map to the questions users actually ask.

5.2 Quality quirks worth noting

GPT-5-mini miscounted twice — reporting 10 HubSpot contacts (correct: 9) and 40 QuickBooks customers (correct: 39), typically by counting a paginated subset.
GPT-5-mini sometimes hedged instead of answering — on QuickBooks it asked the user to clarify what “top vendors” meant rather than committing, which is safer but lowers first-answer completion.
Claude Haiku’s verbose, enumerated answers were consistently on the ground-truth value, at the cost of more tokens.

6. Why These Differences Occur: Research-Backed Insights

Our results line up closely with a growing body of public engineering analysis and academic work on tool-augmented LLMs. The mechanisms below explain the patterns we observed.

6.1 Tool definitions are a fixed, per-request tax

Because most MCP clients inject every tool definition into the system prompt on each request, the schema cost is paid whether or not a tool is used. Public measurements put a single tool definition at roughly 200–1,400 tokens, so a 50-tool server commonly consumes 10,000–25,000+ tokens before the user has said anything.

GitHub’s official MCP server alone has been measured at about 17,600 tokens of definitions per request, and stacking several servers pushes past 30,000 [2]. This is exactly the fixed overhead our Thick servers carried and our Thin servers avoided.

6.2 Controlled benchmarks find large MCP-vs-direct gaps

A controlled benchmark cited across the industry (Scalekit, 75 matched comparisons) found MCP using Up to 32 more tokens than a leaner calling style for identical operations, with one simple task measured at about 1,365 tokens the lean way versus 44,026 via MCP; the overhead was “almost entirely schema” from dozens of injected definitions of which only one or two were used. One team reported three connected servers eating 143,000 of a 200,000-token window — 72% of capacity — before any real work began. Our Thick-vs-Thin gap is the same phenomenon at smaller scale.

6.3 Too many tools also degrades accuracy

Crucially, large tool sets don’t only cost tokens — they make models choose worse. A long-context tool-calling study (LongFuncEval) observed a Up To 85% drop in performance as the number of available tools grew, and further degradation as conversations lengthened [7]. Retrieval-based approaches that present only the few relevant tools (RAG-MCP) more than tripled tool-selection accuracy (from about 14% to 43%) while cutting prompt tokens by more than half; related work reports a 99.6% token reduction by narrowing 50–100+ tools down to the 3–5 that matter. This is the academic basis for our recommendation to keep servers thin: it is good for the bill and for correctness.

6.4 Why Direct API was costly and unreliable

Stripping the schema removes the structured contract that tells a model exactly how to call an operation. The model is left to infer method IDs and parameter shapes, which is brittle — and each failed attempt and clarifying turn adds more context. This matches the retry loops visible in our Direct-API logs and is why a typed, narrowly-scoped MCP layer beats raw API access on both cost and reliability.

7. Recommendations for Designing an MCP Server

The following recommendations translate the findings into design rules for a production MCP server. They are ordered by impact.

Expose the fewest tools that cover real tasks. A task-scoped server cut tokens ~75% with no accuracy loss. Treat every exposed tool as a recurring tax on every request.

Offer task- or role-scoped tool sets, not one giant surface. Ship multiple thin servers (or filtered views) per workflow rather than a single thick connector. Where breadth is unavoidable, add on-demand tool discovery so definitions load only when needed.

Prefer a typed MCP layer over raw API access. The schema is not overhead to be eliminated; it is what prevents the costly, unreliable retry loops we saw in Direct mode.

Design endpoints around the questions users ask. The universal “base price” failure was a missing field, not a model fault. Surface the values that natural-language questions target; fold key sub-resource fields into the primary response.

Control response bloat. Return compact, field-filtered payloads and paginate sensibly. A “list all” that dumps full records will flood context as surely as too many tool definitions.

Write tight tool descriptions and schemas. Token cost scales with description length and parameter count; trim verbose descriptions and remove unused optional parameters.

Make pagination and counts explicit. Both models miscounted by reading a paginated subset. Provide a true total-count field or a tool whose contract returns the full count.

Measure tokens per tool and per task. Instrument the server, track schema and response token share, and set alerts; you cannot optimise what you do not measure.

Benchmark across models. The same server cost ~2× more on one model than another. Test your real workflows on each target model and size your context budget for the most expensive one.

8. Limitations and Data-Quality Notes

In the interest of an honest read, the following caveats apply to the dataset.

Two of the 30 logs were not fully usable: HP-6 was missing its token-usage section, and HP-11 was a duplicate export of HP-5 rather than the intended Claude/Direct/Set-2 run. Token averages are computed over the 28 measured runs; accuracy uses every run for which an answer was recorded.
Several Direct-API logs end on a proposed tool call, so their second-question outcome is recorded as “incomplete.” Where a run clearly looped or failed, that is itself a finding rather than a gap.
Each configuration was run once, not repeated; figures describe this dataset rather than statistically smoothed estimates. A small set of files carried minor labelling inconsistencies (e.g., QuickBooks set numbering), which we reconciled to the master test matrix.
Token counts reflect the CYCLR test harness, the specific connectors, and these two model versions as of the test date; absolute numbers will shift with other clients, connectors, and model releases, but the directional findings are consistent with independent benchmarks (Section 6).

9. Conclusion

For software teams building MCP servers, the message from 30 controlled tests is simple and actionable: design for scarcity. Expose only the tools a task needs, surface the fields users actually ask about, keep responses compact, and keep the typed layer in place. Doing so cut token consumption by about three-quarters in this study with no loss of accuracy — and avoided the hidden retry costs that made raw API access both pricier and less reliable than expected. A well-scoped MCP server is not merely a connectivity convenience; it is the single most effective cost-and-quality control available to an AI product team.

Appendix A. Full Test Dataset (30 configurations)

Test	App	Model	Conn.	Tools	Set	Total tok.	Q1	Q2
HP-1	HubSpot	Haiku	Thick	24	Q1	79,647	✓	✓
HP-2	HubSpot	Haiku	Thin	6	Q1	35,633	✓	✓
HP-3	HubSpot	GPT-5m	Thick	24	Q1	30,190	✓	✓
HP-4	HubSpot	GPT-5m	Thin	6	Q1	23,562	✓	✓
HP-5	HubSpot	Haiku	Direct	—	Q1	22,128	✓	inc.
HP-6	HubSpot	GPT-5m	Direct	—	Q1	—	✓	inc.
HP-7	HubSpot	Haiku	Thick	24	Q2	86,372	✓	✓
HP-8	HubSpot	Haiku	Thin	5	Q2	55,161	✓	✓
HP-9	HubSpot	GPT-5m	Thick	24	Q2	46,843	✗	✓
HP-10	HubSpot	GPT-5m	Thin	5	Q2	20,047	✓	✓
HP-11	HubSpot	Haiku	Direct	—	Q2	—	—	—
HP-12	HubSpot	GPT-5m	Direct	—	Q2	13,435	✓	inc.
NS-1	NetSuite	Haiku	Thick	34	Q1	141,138	✓	✓
NS-2	NetSuite	Haiku	Thin	7	Q1	11,135	✓	✓
NS-3	NetSuite	GPT-5m	Thick	34	Q1	81,353	✓	✓
NS-4	NetSuite	GPT-5m	Thin	7	Q1	6,632	✓	✓
NS-5	NetSuite	Haiku	Direct	—	Q1	42,871	inc.	✗
NS-6	NetSuite	GPT-5m	Direct	—	Q1	33,133	inc.	inc.
NS-7	NetSuite	Haiku	Thick	34	Q2	142,147	part.	✓
NS-8	NetSuite	Haiku	Thin	4	Q2	9,776	part.	✓
NS-9	NetSuite	GPT-5m	Thick	34	Q2	83,036	part.	✓
NS-10	NetSuite	GPT-5m	Thin	4	Q2	10,949	part.	✓
NS-11	NetSuite	Haiku	Direct	—	Q2	60,403	part.	inc.
NS-12	NetSuite	GPT-5m	Direct	—	Q2	13,755	inc.	inc.
QS-1	QuickBooks	Haiku	Thick	38	Q1	138,513	✓	✓
QS-2	QuickBooks	Haiku	Thin	4	Q1	28,066	✓	✓
QS-3	QuickBooks	GPT-5m	Thick	38	Q1	50,727	✓	part.
QS-4	QuickBooks	GPT-5m	Thin	4	Q1	16,836	✗	✓
QS-5	QuickBooks	Haiku	Direct	—	Q1	34,732	✗	✓
QS-6	QuickBooks	GPT-5m	Direct	—	Q1	54,954	inc.	✓

Legend: ✓ clean correct · ✗ wrong · part. = partial (e.g., returned cost instead of base price) · inc. = incomplete log/run · — not available (HP-6 missing tokens, HP-11 duplicate export).

Appendix B. Sources

[1] Anthropic Engineering — Code execution with MCP: building more efficient AI agents. https://www.anthropic.com/engineering/code-execution-with-mcp

[2] StackOne — MCP Token Optimization: 4 Approaches Compared. https://www.stackone.com/blog/mcp-token-optimization/

[3] MindStudio — How to Optimize MCP Server Token Usage. https://www.mindstudio.ai/blog/optimize-mcp-server-token-usage

[4] DeployStack — MCP Context Window Explained: Where Tokens Actually Go. https://deploystack.io/blog/how-mcp-servers-use-your-context-window

[5] Apideck / DEV — Your MCP Server Is Eating Your Context Window (Scalekit benchmark). https://www.apideck.com/blog/mcp-server-eating-context-window-cli-alternative

[6] Gan & Sun — RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection (arXiv:2505.03275). https://arxiv.org/abs/2505.03275

[7] LongFuncEval: Measuring long-context models for function calling (arXiv:2505.10570). https://arxiv.org/abs/2505.10570

[8] Semantic Tool Discovery for LLMs: A Vector-Based Approach to MCP Tool Selection (arXiv). https://arxiv.org/abs/2603.20313

[9] MCP Playground — MCP Token Counter: Why Your Tools Are Silently Eating Your Context Window. https://mcpplaygroundonline.com/blog/mcp-token-counter-optimize-context-window

Sources are paraphrased throughout; figures attributed to external benchmarks are reproduced as reported by those sources and were not independently re-measured in this study.