Habr AI→ original

Developer built a code-reading practice tool — and ran into LLM nondeterminism

A developer got tired of slowly reading other people's code — and his own old code — and built a practice tool: you read a ready-made snippet, explain it in…

AI-processed from Habr AI; edited by Hamidun News
Developer built a code-reading practice tool — and ran into LLM nondeterminism
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

A developer from Habr launched an unusual project: a trainer where you don't write code, but read a finished working fragment and explain it in your own words — and a language model evaluates the quality of your explanation. The idea is simple, but the implementation turned out to be unexpectedly complex.

Where the idea came from

Two irritations accumulated gradually. The first was my own code from two months ago: in principle everything is clear, but you have to read through it longer than you'd like. The second was attempts to explain the project's architecture to friends: you know everything perfectly well, but you speak in fragments, stumble over words, can't connect your thoughts into a coherent whole. The trainer grew out of these two problems.

The mechanics are simple: you're shown a real working code fragment, you explain it in your own words — either orally or in writing — and you get a rating from an LLM. No code writing at all, just reading and explaining. Something between code review and a conversation with a mentor.

The idea itself isn't new — explaining code aloud is used in pair programming and technical interviews. But an automated, always-available version is already interesting.

Where it broke: LLM non-determinism

Writing the trainer itself turned out to be straightforward. The difficulties began when configuring the evaluation. The task looks trivial: give the model two pieces of text — the code and the user's explanation — and ask it to assess how well one describes the other. In practice, the model behaved unpredictably:

  • For the same explanation, it gave different scores on repeated requests
  • It valued some code aspects unfairly high, ignored others without reason
  • The "strictness" of the evaluator changed from request to request without visible patterns
  • It was unclear what should count as a good explanation — comprehensive or concise?

This is classic LLM non-determinism — a property everyone knows about in theory, but which is acutely felt precisely when you need a reproducible evaluation function from the model, not text generation.

What exactly to evaluate

The developer discovered that the main problem isn't technical, but conceptual: what counts as a good code explanation? Should it be comprehensive — covering all branches, edge cases, side effects? Or is it enough to accurately convey the main idea of the algorithm? Should you mention potential bugs? Is terminology important? Should the explainer demonstrate understanding of why this code was written, not just what it does?

Without clear answers, any evaluation criteria for an LLM become vague — and the model fills the uncertainty arbitrarily. This is precisely why prompt engineering for evaluation tasks is significantly more complex than for generative tasks.

Possible technical approaches: strict prompts with specific rubrics, voting across multiple independent requests to the model, reference explanations as a benchmark. But first you need to determine what exactly you're measuring.

"The hardest part turned out not to be writing the trainer, but making

the neural network evaluate honestly and consistently — and understanding what exactly needs to be evaluated"

What this means

The story illustrates a typical LLM project trap: the simplest-looking part — "the model will evaluate" — actually requires more engineering effort than the entire rest of the product. The task of training the skill of reading and explaining code is a real need, especially in teams with large amounts of legacy code. But building a reliable automated evaluator for such practical skills remains an open engineering challenge.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…