LangGraph gets fault tolerance: retries, timeouts, and error handlers for production agents
LangChain published a detailed breakdown of three fault-tolerance primitives built into LangGraph. RetryPolicy provides automatic retries with exponential…
AI-processed from LangChain Blog; edited by Hamidun News
LangChain published a detailed guide on three built-in LangGraph resilience primitives — without them, any production agent inevitably breaks where the prototype worked like clockwork.
Why prototypes break in production
In laboratory conditions, a LangGraph agent looks reliable: input data is fixed, external APIs respond quickly, the user is always available. In real-world operation, the picture is different. External services hang or return 503 under load. LLM providers at peak demand delay responses for minutes. In human-in-the-loop scenarios, a person may not respond for hours. Each of these cases, without special handling, turns into a hung or crashed agent. The classic solution — wrap each call in try/except, write timers, add retry logic manually. This works, but the defensive code grows, mixes with business logic and itself becomes a source of errors. LangGraph offers a different approach: three primitives built into the engine and working declaratively.
RetryPolicy: automatic retry attempts **RetryPolicy** configures
automatic retry attempts with exponential backoff. Configurable parameters: maximum number of attempts, initial delay, maximum delay and its growth coefficient between attempts. The policy can be set precisely — for a specific node calling an unstable external API — or applied to the entire graph as a global default. The second option is convenient when all interaction with external services should follow unified recovery rules.
TimeoutPolicy: time limits **TimeoutPolicy** solves a different task —
limits the time a node is allowed to take. Two types of limits are supported: `wall_clock_timeout` — maximum elapsed time for node execution from start to completion `idle_timeout` — maximum idle time. Particularly important in human-in-the-loop scenarios: if the user hasn't responded in N minutes, the agent should continue along an alternative branch or complete with an error Both limits can be combined in one policy — whichever comes first will trigger When the limit is exceeded, the engine automatically raises an exception * The policy is applicable to a node, subgraph, or entire graph The key advantage of both primitives — they live inside the engine and see the full context of the graph state.
External decorators and wrappers cannot do this.
error_handler and the SAGA pattern **error_handler** — the third
primitive, which triggers after all retries are exhausted. This is the final recovery point: undoing already completed actions, sending notifications, saving diagnostics, moving the agent to a safe state. For multi-step agents with real side effects — resource reservation, fund deduction, creation of records in external systems — LangChain recommends the SAGA pattern.
The idea: each step of the agent is accompanied by a compensating operation that undoes its effect. If step N fails after successful completion of steps 1 through N-1, compensating operations are launched in reverse order — the system returns to a consistent state. LangGraph allows you to embed SAGA directly in the graph: compensations are stored next to nodes, and error_handler runs their chain on failure.
"Having resilience policies inside the engine, not outside it — is a fundamental difference: recovery logic gets the full context of the graph state," —
LangChain blog.
What this means
LangGraph offers mature tools for moving AI agents from prototype to real-world operation. Retries, timeouts and compensating transactions are built into the engine and don't require manual wrapping of each node. For teams building agents under production loads, this reduces the volume of defensive code and makes failure behavior predictable and controllable.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.
The AI world, distilled — once a week
Seven stories that actually mattered, hand-picked. No noise, no reposts, no press releases.
Done! Check your inbox for a confirmation.