LangGraph gets fault tolerance: retries, timeouts, and error handlers for production agents

LangChain published a detailed breakdown of three fault-tolerance primitives built into LangGraph. RetryPolicy provides automatic retries with exponential…

Hamidun News Editorial

AI monitoring · LangChain Blog

Jun 30, 2026· 2 min

AI-processed from LangChain Blog; edited by Hamidun News

LangGraph gets fault tolerance: retries, timeouts, and error handlers for production agents — Source: LangChain Blog. Collage: Hamidun News.

◐ Listen to article

LangChain published a detailed guide on three built-in LangGraph resilience primitives — without them, any production agent inevitably breaks where the prototype worked like clockwork.

Why prototypes break in production

In laboratory conditions, a LangGraph agent looks reliable: input data is fixed, external APIs respond quickly, the user is always available. In real-world operation, the picture is different. External services hang or return 503 under load. LLM providers at peak demand delay responses for minutes. In human-in-the-loop scenarios, a person may not respond for hours. Each of these cases, without special handling, turns into a hung or crashed agent. The classic solution — wrap each call in try/except, write timers, add retry logic manually. This works, but the defensive code grows, mixes with business logic and itself becomes a source of errors. LangGraph offers a different approach: three primitives built into the engine and working declaratively.

RetryPolicy: automatic retry attempts RetryPolicy configures

automatic retry attempts with exponential backoff. Configurable parameters: maximum number of attempts, initial delay, maximum delay and its growth coefficient between attempts. The policy can be set precisely — for a specific node calling an unstable external API — or applied to the entire graph as a global default. The second option is convenient when all interaction with external services should follow unified recovery rules.

TimeoutPolicy: time limits TimeoutPolicy solves a different task —

limits the time a node is allowed to take. Two types of limits are supported: `wall_clock_timeout` — maximum elapsed time for node execution from start to completion `idle_timeout` — maximum idle time. Particularly important in human-in-the-loop scenarios: if the user hasn't responded in N minutes, the agent should continue along an alternative branch or complete with an error Both limits can be combined in one policy — whichever comes first will trigger When the limit is exceeded, the engine automatically raises an exception * The policy is applicable to a node, subgraph, or entire graph The key advantage of both primitives — they live inside the engine and see the full context of the graph state.

External decorators and wrappers cannot do this.

error_handler and the SAGA pattern error_handler — the third

primitive, which triggers after all retries are exhausted. This is the final recovery point: undoing already completed actions, sending notifications, saving diagnostics, moving the agent to a safe state. For multi-step agents with real side effects — resource reservation, fund deduction, creation of records in external systems — LangChain recommends the SAGA pattern.

The idea: each step of the agent is accompanied by a compensating operation that undoes its effect. If step N fails after successful completion of steps 1 through N-1, compensating operations are launched in reverse order — the system returns to a consistent state. LangGraph allows you to embed SAGA directly in the graph: compensations are stored next to nodes, and error_handler runs their chain on failure.

"Having resilience policies inside the engine, not outside it — is a fundamental difference: recovery logic gets the full context of the graph state," —

LangChain blog.

What this means

LangGraph offers mature tools for moving AI agents from prototype to real-world operation. Retries, timeouts and compensating transactions are built into the engine and don't require manual wrapping of each node. For teams building agents under production loads, this reduces the volume of defensive code and makes failure behavior predictable and controllable.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation

LangGraph gets fault tolerance: retries, timeouts, and error handlers for production agents

Why prototypes break in production

RetryPolicy: automatic retry attempts **RetryPolicy** configures

TimeoutPolicy: time limits **TimeoutPolicy** solves a different task —

error_handler and the SAGA pattern **error_handler** — the third

What this means

Want to stop reading about AI and start using it?

The AI world, distilled — once a week

RetryPolicy: automatic retry attempts RetryPolicy configures

TimeoutPolicy: time limits TimeoutPolicy solves a different task —

error_handler and the SAGA pattern error_handler — the third