Habr AI Explained How to Build a Production Agent With Durable State, Steps, and Events

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 27, 2026. Reading time: 3 min.

A production agent needs more than current session memory: after a failure, it must recover the request, plan, step status, and execution history. The…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 27, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

Habr AI Explained How to Build a Production Agent With Durable State, Steps, and Events — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

For AI agents to function properly in production, it is not enough to keep everything in process memory: after a restart, the agent must remember what the user actually requested, what plan was already built, which tools have been executed, and where execution stopped. In the first part of a new practical breakdown on Habr AI, the author proposes building such durable state around three basic entities — turn, step, and event — and explains why, without them, long agent scenarios quickly transform into an opaque and fragile pipeline. The starting point is simple: if an agent stores state only in-memory, any service failure zeroes out the task.

For a demo this is acceptable, but a production agent can analyze documents, wait for user confirmation, and execute a long chain of actions, so after a failure it needs to recover from a database, not from fragments of a prompt. As a minimal set of persistent entities, the author lists AgentTurn, AgentPlanItem, and AgentEvent, and also notes that nearly inevitably ApprovalGrant, SessionContext, and BackgroundJob appear nearby. The idea is that durable state describes not just the final answer but the entire path to it: the original request, the normalized command, confirmation flags, execution statuses, and potential errors.

AgentTurn in this scheme is a full record of one user turn. It stores a session identifier and turn_id, message text, normalized command, and processing status like created, planned, awaiting_approval, running, completed, or failed. Importantly, the turn also captures both the final output_text and error if execution failed.

This removes critical dependence on model "memory": the backend can determine at any moment what exactly was happening with the request, even if the process was restarted. For long tasks this is especially important because a single request rarely comes down to a single model call — more often it involves a chain of file reads, tool invocations, verification, and result preparation. The next layer is AgentPlanItem, that is, an individual step within a turn.

If a user asks to analyze a project and prepare a report, the agent can decompose the task into several actions: find documents, read relevant files, verify data, and assemble the final answer. For each step it is proposed to store its own item_id, ordinal number, tool name, arguments, confirmation mode, and status. The article particularly emphasizes that modes safe_readonly, confirm_once, and mutating are needed not for decoration: they allow to pre-divide safe read operations, single-approval actions, and potentially dangerous mutations.

As a result, the system knows not just "the agent is doing something," but which exact tool should start next, what has already completed, what can be retried, and where execution got stuck. The third mandatory entity is AgentEvent, that is, a timeline of what is happening. It is precisely events that transform a turn from a black box into an observable system.

Instead of one vague state, the frontend can read turn_started, tool_started, tool_progress, tool_completed, approval_requested, tool_failed, and turn_completed, and then display clear progress. The example from the text is grounded and therefore useful: the agent runs collect_documents, finds 12 documents, then on the analyze_documents step reports progress 40 out of 100, and in case of failure writes external service timeout and marks the error as retryable. For the user this means normal UX instead of endless "the agent is thinking," and for the development team — the ability to debug the pipeline, conduct audits, analyze incidents, and recover tasks after restart without manual reconstruction of history.

The main conclusion from the Habr AI material is that a production agent in 2026 is not a lucky prompt with a wrapper, but a stateful system with an execution journal. If an agent has no durable state at the level of turn, step, and event, it poorly survives failures, is opaque to the interface, and is almost impossible to support. Which means the next stage of evolution of agent applications lies not so much in new models as in backend architecture discipline: in how we capture state, control approvals, and transform model work into a reproducible operational process.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation