Habr AI→ original

Target Encoding Without Data Leakage: LOO and K-Fold vs. the Illusion of Quality

Target encoding seems like a simple way to handle categorical features — but a naive implementation quietly leaks the target into the training set…

AI-processed from Habr AI; edited by Hamidun News
Target Encoding Without Data Leakage: LOO and K-Fold vs. the Illusion of Quality
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Target encoding is a popular method for encoding categorical features, but its naive implementation systematically inflates metrics and creates the illusion of a good model that falls apart in production.

What is target encoding

Target encoding replaces each unique value of a categorical feature with the average value of the target variable across all objects of that category. For the "city" feature, each city is assigned the average sales across all customers from that city. For the "browser" feature — the average conversion across all sessions with that browser. One numerical column instead of hundreds of binary ones.

The method is particularly attractive with high cardinality: instead of hundreds of binary columns from One-Hot Encoding, a single compact numerical feature remains that directly carries information about the category-target relationship. This is why target encoding is actively used in Kaggle competitions and industrial ML pipelines — the model receives an informative input that trains quickly and is easy to interpret.

Where does leakage come from

The problem arises at the moment of calculating the average. A naive implementation computes the encoding over the entire training sample — including the current object itself. As a result, the target of this object participates in calculating the feature that is then fed to the model as input during training. The model essentially sees the target variable in hidden form — not directly, but through this feature.

The consequences of such leakage are predictable:

  • Metrics on train and cross-validation look excellent — the model "knows" the answer through the feature
  • The model memorizes noise and outliers of specific objects, not real patterns
  • On test or in production, quality drops sharply — there the encoding is calculated from the train without the current object
  • The smaller the number of objects in the category — the stronger the leakage: with one object, the encoding simply equals the target
  • The effect is invisible under standard metric checks, but manifests in A/B testing in production

This is a classic trap: everything looks perfect until deployment, after which the model turns out to be useless. Many competition solutions on Kaggle showed brilliant CV for precisely this reason, but did not survive the final check.

LOO and K-Fold: how to calculate correctly

Both approaches solve one task: when calculating encoding, do not use the object's own target value.

Leave-One-Out (LOO) when encoding each object excludes its value from the average calculation. Formally: the average of the target across all objects of the same category, except the current one. The direct dependency is broken, information about the category distribution is preserved. The implementation is straightforward and deterministic.

K-Fold encoding works differently. The training sample is divided into K folds. For each fold, encoding is computed only from the remaining K-1 folds, then applied to the "held-out" fold. The scheme is analogous to cross-validation: no object participates in calculating its own encoding.

"An honest feature is one that is calculated during training exactly

as it will be calculated in production."

Each method has its nuances: LOO is deterministic and adds minimal noise, but with small categories (1-2 objects) remains vulnerable to residual leakage. K-Fold introduces regularization noise due to random splitting — this is a useful feature, not a bug. For both methods, one rule is important: encoding for the test sample is always calculated from the entire training sample as a whole, without LOO or K-Fold — this is exactly how it will work in production.

What does this mean

Target encoding remains a powerful tool for working with categorical features, but requires careful implementation. The naive approach creates an illusion of quality — beautiful metrics that won't survive production. LOO and K-Fold provide honest features: validation numbers reflect the real generalization ability of the model, not an artifact of data leakage. If metrics seem too good — encoding should be checked first.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…