Habr AI→ original

Habr Explained How to Force LLMs to Calculate Without Errors Through Python Code Generation

Habr showed a simple way to eliminate LLM arithmetic errors: instead of asking the model to calculate directly, force it to generate a Python script and…

AI-processed from Habr AI; edited by Hamidun News
Habr Explained How to Force LLMs to Calculate Without Errors Through Python Code Generation
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Habr published a practical breakdown of why LLMs regularly make mistakes in arithmetic and how to work around this in a real product. Instead of asking the model to calculate itself, the author suggests giving it a different role: write a Python script and hand off the computations to an ordinary program.

Why LLMs Make Mistakes

The problem isn't that some particular chatbot "broke down." A transformer predicts the next token by probability rather than invoking a calculator. So when multiplying, recalculating a recipe, or computing utility bills, the model can output an answer that looks convincing but differs from the correct one by several percent or even tens of percent. To a user, it looks like degradation, though it's actually a fundamental architectural limitation: LLMs reproduce the pattern of computation well, but don't actually perform the operation.

"The model doesn't calculate.

The model programs. And the program calculates."

This makes tasks especially dangerous when the error isn't immediately obvious. If a person can already verify the result on paper, they don't need an LLM. But when the model is used precisely to avoid manual calculation, a plausible number easily passes without verification. The article gives an example with utility bills: the model can recall an outdated tariff, multiply it "in its head," and format the answer nicely, though the calculation inside is wrong.

How the Scheme Works

The working scheme is built around role division. A user sends a task to a messenger, the LLM receives a system prompt with context and necessary data, then generates Python code. This code runs in an isolated Docker sandbox, and the service returns not only formatted text but also a ready-made Excel file. In such a scenario, the model handles request understanding and program structure, while arithmetic precision is entirely delegated to the Python interpreter.

  • Input can be meter readings, a table, or an estimate
  • Tariffs and reference books are supplied to the prompt from a config file
  • The model must return Python code, not a ready answer
  • The script executes in an isolated container with a timeout
  • The user receives a text calculation and an Excel file

The author writes that for such tasks he uses Qwen and DeepSeek rather than expensive top-tier models. The logic is pragmatic: if you need to generate a script of 20–200 lines, the difference in code quality between premium and more accessible models is small, but the difference in price is noticeable. A separate emphasis: tariffs and reference books should come to the prompt from a config, not from the model's "memory." If a rate changes, it's enough to update one line of data without touching the model itself.

Where Things Went Wrong

The most common early mistake was asking the model to find the tariffs itself. In such a mode, it confidently substitutes outdated or someone else's data, and the error looks plausible. So the author moved all sensitive numbers to a config and updates them from official sources separately.

A second problem: some models still try to "calculate in their head" and provide a ready answer even after instructions. The solution is simple: validate the presence of Python code and, if necessary, send a follow-up request with strict phrasing. In practice, more technical issues surfaced: Cyrillic in Excel broke without explicit UTF-8, the model pulled unnecessary libraries like pandas, and without full stderr it couldn't fix its own errors after script failure.

But when the service started returning traceback back to the model, the number of useless iterations, according to the author, dropped fivefold. The same approach was already applied to a more complex task—analysis of repair estimates, where one test showed an overcharge of 54,168 rubles and eight items more expensive than market by over 50%.

What This Means

The approach "LLM writes code, not an answer" looks like one of the most practical ways to use models where accuracy matters. For accounting, estimates, taxes, and any calculation scenarios, this eliminates the main risk: text, structure, and automation remain with the model, while the verified numbers stay with ordinary software.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…