hh.ru explained how to design production prompts for AI services without surprises
hh.ru explained why a production prompt looks more like code than a conversation with ChatGPT. The team advises writing instructions in English, handling…
AI-processed from Habr AI; edited by Hamidun News
hh.ru shared a practice on how to write prompts for their AI services in production. The main idea is simple: a prompt in a product is not a conversation with a chatbot, but an engineering system with constraints, tests, and constant debugging.
Production is Not Chat
In typical LLM usage, everything is quite flexible: a user asks a question, gets an answer, refines the phrasing, restarts the dialog, and moves on. In a product, there's no such luxury. Here, one failed response can reach thousands of users, break a scenario, create a reputational risk, or simply worsen conversion.
Therefore, a production prompt is not a single phrase like "make it pretty," but a set of interconnected instructions, data, rules, and tool calls, sometimes spanning hundreds of lines. The article author calls this an engineer's battle with a "stochastic parrot." The model doesn't understand meaning the way humans do; it predicts the next token based on probabilities.
The team's task is to maximize the reduction of the randomness space: give the model a clear role, context, constraints, and expected answer format. The better this loop is designed, the higher the chance of getting a predictable, safe, and useful result for real business. This is why working with prompts increasingly resembles regular development rather than a creative experiment.
The Framework of a Good Prompt
At hh.ru, they recommend writing the instructions themselves in English, while leaving examples of user messages in the product language—in this case, Russian. The reason is not just that English-language instructions are often interpreted more accurately by the model. English also saves tokens, and in systems with thousands and millions of calls, this already affects cost and latency. Templates and markup help additionally: markdown or XML make long instructions more structured and reduce ambiguity. A typical framework usually includes the model's role, goal, context, problem-solving steps, and answer format.
- model role
- goal and specific task
- context of input data
- action algorithm or verification steps
- constraints and answer format
Few-shot examples are particularly risky. They do help the model better understand the task, but they just as easily turn into a template that it starts mechanically transferring to new situations. The model often clings to phrasings literally and reproduces them out of context. The article provides an illustrative case: they added an example of a clarifying question for a candidate to the system prompt, after which the agent started asking it even where it was completely inappropriate.
"Are you ready for business trips to Ryazan?"
After that, the assistant periodically asked about trips even in job postings where travel was not involved.
The team's conclusion is harsh: everything risky should be explicitly forbidden. If a bot should not discuss other companies, express its opinion, go off-topic, or perform unrelated tasks, this needs to be spelled out directly. Another practical tip is not to fear long prompts if they are logically assembled and do not contradict themselves. It's also important to explicitly pass the current date, carefully adjust temperature, and remember that prompts almost always need to be rewritten for different models.
How They Test It
Even a good prompt cannot be considered ready after a couple of successful runs. LLM behavior is not fully deterministic: with identical requests and identical parameters, responses can still vary slightly. Therefore, quality assurance is more like an engineering evaluation of a system than manual text proofreading. You need large sets of test cases, multiple runs, and coverage of different user scenarios—almost like in classic testing, but with adjustments for the probabilistic nature of the model.
The most valuable source of new tests is real user logs. That's where unexpected questions surface, attempts to divert the bot, and edge cases that the team didn't anticipate. As such cases accumulate, the evaluation dataset needs to be constantly replenished. Another important finding: prompts should be tested in an environment as close as possible to production. LLMs are sensitive even to minor changes in input format, so a "nearly identical" environment easily gives false confidence in stability.
What This Means
The hh.ru article well demonstrates that prompt engineering is rapidly turning into regular product engineering. Here, the victory goes not to the most creative request, but to a combination of structure, constraints, evals, logs, and iterative refinement. For teams building AI features in production, this is a signal: prompts now need to be versioned, tested, tracked by metrics, linked to real user scenarios, and adapted to specific models as seriously as code.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.