How a product manager can assess AI product quality: a guide to evals
Evals — assessing the quality of an AI product — have suddenly become the key skill for product managers. Anthropic and OpenAI executives say it plainly: two…
AI-processed from Habr AI; edited by Hamidun News
Evals — evaluating the quality of an LLM product — have suddenly become the most discussed skill among product managers in AI companies. Top executives at Anthropic and OpenAI openly call the ability to build evaluation systems a key competency for any product manager working with language models. In Lenny Rachitsky's podcast, researchers Hamil Hussein and Shreya Shankar broke down how PMs should approach evaluating an AI product — and why intuition doesn't work here.
What is an eval and why is it needed
An eval is a systematic check of how well a language model performs a specific task in the specific context of your product. Unlike classic software testing, where an answer is either right or wrong, in LLM products the answer almost always falls somewhere in the middle. The same query can produce dozens of different, but equally acceptable answers — and the PM's job is to understand which one is best for a specific user in a specific situation.
Most teams at the start evaluate models subjectively: they look at a few examples and draw conclusions. This works for simple functions, but completely breaks down when scaling. When a product gets a million requests a day, manual review is impossible — you need a system that works automatically and reproducibly.
Three levels of evaluating an AI product
Experts recommend building evals in three consecutive layers.
The first is defining success criteria. Before measuring anything, a PM must answer the question: what does a "good answer" mean for our product? This can be factual accuracy, brand tone alignment, length, structure, absence of toxicity, or safety. Without this step, any metrics are meaningless — you'll be measuring something that doesn't matter to the user.
The second level is assembling a "golden set." This is a collection of example queries with ideal answers, either created manually or selected from real data. The model is tested against this set with each update. The quality of the golden set directly determines the quality of the entire evaluation system — this is both the main challenge and the main responsibility of the PM.
The third level is automating evaluation. At this stage, the team builds a pipeline: a new version of the model or prompt is run through the golden set, results are compared with standards — automatically or using a judge model, i.e., another LLM that evaluates answers. Regression is immediately visible in numbers, not discovered in user feedback a week after release.
Why PM can't delegate this to engineers
The temptation to hand evals over to the technical team is great, but it's a mistake. Evals are product decisions: what matters to the user, what they consider a good answer, what trade-offs we're willing to make for speed or cost. An engineer doesn't know why a user prefers a brief answer to a lengthy one, or why a "friendly but professional" tone is three percent more important than a slightly more accurate answer.
It's the PM who builds the connection between eval metrics and real business results. If the model became five percent more accurate, but user satisfaction didn't change — something is wrong with the evaluation criteria itself. Finding and fixing this mismatch is a product task, not an engineering one.
What this means for the market and career
Two years ago, the word "evals" appeared mainly in academic papers. Today it's a standard part of the roadmap for any serious AI product. Companies that have learned to systematically measure the quality of their LLM solutions gain a sustainable competitive advantage: they detect regressions faster, compare models more accurately, and make update decisions based on data, not on subjective team feelings.
For a product manager's career, the conclusion is straightforward: if you work with AI products and don't know how to build evals — you're losing to colleagues who do. This skill has become as essential as knowing how to work with a sales funnel or conduct A/B tests.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.