Z.AI showed how to build production-ready agentic systems on GLM-5 with tool calling

Q: What is the source?

Originally published on MarkTechPost. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 28, 2026. Reading time: 3 min.

Z.AI showed how to build not just a chatbot but a production-ready agentic stack from GLM-5. The tutorial covers the essentials: SDK and OpenAI-compatible…

Hamidun News Editorial

AI monitoring · MarkTechPost

Apr 28, 2026· 3 min

AI-processed from MarkTechPost; edited by Hamidun News

Z.AI showed how to build production-ready agentic systems on GLM-5 with tool calling — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

Z.AI released a technical breakdown that is rare in its usefulness, in which GLM-5 is presented not as yet another chat interface, but as the foundation for production-ready agentic systems. The material consistently follows the path from the first request to the model to a full-fledged multi-step agent with tool calling, streaming output, thinking mode, and support for multi-turn dialogue.

For developers, this is an important signal: the bet is being placed not only on the quality of answers, but also on the maturity of integration into a real product stack. At the beginning, the authors set up a basic environment through zai-sdk, openai, and rich, obtain an API key from environment variables or through hidden terminal input, and launch the ZaiClient for initial model calls. Next, a minimal chat completion scenario is shown: GLM-5 answers a simple technical question, after which the same interface is used in streaming mode, where tokens arrive as they are generated.

This is not a cosmetic feature. For interfaces, assistants, and agent panels, streaming output directly affects perceived speed, and therefore the suitability of the model for work scenarios where the user does not want to wait for a long answer to complete entirely. The next section is devoted to thinking mode and multi-turn context.

In the example for GLM-5, thinking is explicitly enabled with the enabled parameter, and in the streaming response, reasoning_content is read separately, followed by the model's final answer. After this, the authors build a chain of several messages: first they ask about the difference between list and tuple in Python, then clarify when NamedTuple is appropriate, and finally request a practical example with type hints. The point of this section is not the questions themselves, but the demonstration that the model retains context between turns, and the developer can track the growth of message history and token consumption.

For agentic systems, this is a basic requirement: without stable dialogue memory, complex chains quickly fall apart. The most practical part begins where GLM-5 is connected to external functions. The tutorial describes two tools: weather lookup and a calculator for safe expression evaluation.

The model receives a natural language request, itself decides which tool to call, returns arguments, local code executes the function, and then the result is passed back into the model's context for a final answer. Immediately after this, structured output is shown: GLM-5 is asked to extract financial data from text and return clean JSON without explanation. This is already very close to a typical production pattern where the model must not only write beautifully, but also consistently output machine-readable results for pipelines, CRM, analytics, or internal backend services.

The final technical section brings it all together in a GLM5Agent class. It adds several tools at once: weather, calculator, current time, and unit conversion. The agent works iteratively, itself calls the necessary functions as it solves a task, and continues the cycle until it gets a final answer or hits a step limit.

On a separate example, the authors compare how a tricky logic problem performs with thinking mode enabled and disabled, measuring response time and the volume of generated tokens. And in conclusion, they show that GLM-5 can also be used through the standard OpenAI Python SDK: it is enough to change the base_url, and the familiar chat.completions interface continues to work.

According to Z.AI's official documentation, GLM-5 has a context of up to 200K tokens and a maximum of 128K output tokens, which makes such a scenario particularly interesting for long multi-step tasks. What does this mean in practice?

Z.AI is trying to lower the migration bar for teams that already have OpenAI-compatible code but need a more pronounced agentic workflow: tools, JSON, streaming, dialogue memory, and managed execution cycles. It is also important that the tutorial does not go into abstractions, but shows the minimal working loop around the model.

However, there should be no illusions: examples with weather and calculator remain educational, and for production you will still need authorization, logging, retries, tool restrictions, and protection against unsafe calls. But as a map of GLM-5's capabilities, this material is useful: it shows that Z.AI's model is already packaged not just as an LLM for chat, but as a building block for applied AI agents.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation