Habr AI→ original

Raft Analyzed Where MCP and Thin MCP Add Latency to AI Agents

The Raft team analyzed where exactly AI agents lose speed when working through MCP. Tests showed that MCP itself within the process adds about 10ms, while…

AI-processed from Habr AI; edited by Hamidun News
Raft Analyzed Where MCP and Thin MCP Add Latency to AI Agents
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

MCP is often presented as a universal way to neatly connect tools to LLM applications, but in practice, that modularity comes with a latency cost. In a new analysis, Raft compared several architectures and showed that the problem usually lies not in one specific component, but in how the request path from agent to tool and back is structured overall.

Where latency is born

The author started with a basic question: how much latency does MCP itself add if you remove the network and keep everything in a single process. To find out, they compared a monolith without MCP and a monolith with MCP in-process. It turned out that the pattern itself adds relatively modest overhead — around 10–11 ms on average, sometimes up to 35 ms. This is an important baseline: if an agent slows down by hundreds of milliseconds, the culprit is usually not the use of MCP itself, but the outer layer around it.

Next, they moved the comparison to a more realistic architecture, where MCP servers are deployed in separate Docker containers. Here the picture changes noticeably: the average additional latency for main tools grew to about 169 ms per call. At the same time, traces showed that even this is not the main consumer of time. The heaviest parts are obtaining embeddings and reranker work, while vector database search takes relatively little. In other words, MCP adds a cost, but it is not always the main bottleneck of the entire chain.

What the tests revealed

The article breaks down several scenarios to separate the effects of transport, serialization, and runtime itself.

  • S1, MCP in-process: around 10–11 ms of additional latency, meaning the runtime itself is relatively lightweight.
  • S2, separate MCP over Docker network: around 169 ms of overhead per call on average due to network, serialization, and inter-process communication.
  • S3a, Thin MCP over HTTP + JSON: in one measurement series, overhead dropped to around 128 ms, but the result turned out to be unstable and could be noticeably worse in repeated runs.
  • S3b, Thin MCP over HTTP + YAML: latency increased to around 274 ms, indicating an additional cost of serialization and deserialization.
  • S4 and S5: ZeroMQ gave around 200 ms, but with more predictable behavior, while C++ runtime reduced overhead to around 130–145 ms without a radical change in magnitude.

The main takeaway from these numbers is that intuitive optimizations don't always work as expected. Replacing JSON with YAML did not speed up the system, but actually made it worse. Switching from HTTP to IPC also did not yield automatic gains: the iceoryx2 implementation did not show expected improvement, and only the ZeroMQ variant turned out to be practically more beneficial due to its asynchronous model. Even C++ helped moderately, not dramatically.

Why thin doesn't save the day

Thin MCP in the article looks not like a magic speed button, but as a way to simplify architecture. In this scheme, the proxy layer remains minimal and only translates calls, while business logic moves to separate HTTP services. This approach gives language independence, simplifies scaling, and allows you to write executors in Go, Rust, or C++, even if a full MCP SDK doesn't exist for them yet.

Thin MCP is more of an architectural tool than a latency optimization method.

The practical nuance is that the thin approach can look faster in one run but not reproduce stably in another. For a production system, this is critical: sometimes a predictable behavior under repeated load matters more than a minimal one-time p95. That's why Raft makes a rather strict but useful conclusion: if you want to truly speed up an AI agent, you need not just change the language or protocol, but rebuild the interaction schema between proxy, backend components, and heavy computational steps.

What this means

For teams building AI agents, this is good antidote against superficial optimization. If the system is slow, you first need to look at the number of transitions between components, blocking operations, the concurrent execution model, and heavy stages like embeddings and reranking. Thin MCP can make a system cleaner and more flexible, and C++ or IPC can provide additional gains, but the decisive effect appears only when the architecture itself stops running requests through unnecessary layers.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…