Habr AI→ original

Anthropic, OpenAI and LangChain explained why AI agents need a harness

Major AI companies are competing increasingly not just on models, but on agent harness quality. Orchestration, memory, context control and reliable tool…

AI-processed from Habr AI; edited by Hamidun News
Anthropic, OpenAI and LangChain explained why AI agents need a harness
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

The main problem with modern AI agents is not the quality of the base model, but the layer around it: orchestration, memory, context management, and reliable tool operation. This layer, increasingly called agent harness, transforms a stateless LLM from an impressive demo into a system capable of reliably executing long chains of actions, surviving errors, and delivering results. In the early stage, many teams stick to a chat interface, a few tool calls, and a simple ReAct cycle.

For a prototype, this is enough: the model reasons, selects a tool, gets an answer, and continues the dialogue. But in production scenarios, system failures quickly emerge. The agent forgets what it did two or three steps ago, repeats the same calls, loses intermediate results, and the context window fills with random noise.

There's a separate issue with tools: they can return unexpected formats, respond with delays, or fail without clear reason. If there's no control, logging, and recovery layer on top of this, system quality is determined not by model intelligence but by the fragility of the wrapper. This is why major players like Anthropic, OpenAI, Perplexity, and LangChain are building not just new models but full agent infrastructure.

At the center of this infrastructure is the orchestration cycle: it decides when the model needs to think again, when to call a tool, what to save to memory, what to return to the user, and when to stop. Essentially, the harness acts as an operating system for the agent. It sets execution rules, monitors session state, routes actions between the model and external services, and reduces the probability that the agent will slip into an infinite loop or lose sight of the task goal.

Certain components of this approach can already be considered mandatory. First, tool management: interface descriptions, input validation, retries, timeouts, and error handling. Second, multi-layer memory: short-term for the current task, working memory for intermediate results, and longer-term memory for preferences, rules, and accumulated experience. Third, context control: selection of truly important fragments, history compression, garbage removal, and passing the model only what affects the next step. When these mechanisms are absent, even a strong LLM degrades as task length grows. When they are present, the same model begins to work noticeably more reliably.

Another important layer of the harness is related to observability and quality assessment. It's not enough for the developer to know that the answer turned out poorly; they need to see the entire path of the agent: what prompt was sent to the model, what tool was called, what answer was returned, where the error arose, and why the next step was chosen. Without this, it's impossible to properly debug agent behavior and improve the system iteratively. That's why mature stacks add tracing, metrics, sandbox execution, manual checkpoints, and human-in-the-loop mechanisms for risky actions.

The practical effect is telling. The article cites an example from LangChain: the company improved not the model weights themselves but the infrastructure around it, and this was enough to dramatically climb TerminalBench 2.0, from positions outside the top thirty to fifth place. Another result is even more interesting: in a research project, an LLM was used to optimize its own agent infrastructure, and the system achieved a 76.4% pass rate, outperforming manually assembled solutions. This is an important signal for the market.

Competition is shifting from the question "which model is smarter" to "which execution environment better helps the model think, remember, plan, and correct itself." For developers and product teams, the conclusion is direct: if you want a working agent rather than a toy bot, you need to invest not just in model selection but in the harness. Winners will be those who best organize the execution cycle, memory, context, observability, and fault tolerance.

In the near future, the quality of this wrapper, not another jump in benchmarks, will be the main difference between a beautiful demo and a system you can trust with real work.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…