KDnuggets: five outlier detection methods agreed on only 32 of 816 wine samples

Q: What is the source?

Originally published on KDnuggets. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 3, 2026. Reading time: 3 min.

KDnuggets compared five popular outlier detection methods on a real-world dataset of 6497 Portuguese wines. Of the 816 samples that at least one algorithm…

Hamidun News Editorial

AI monitoring · KDnuggets

May 3, 2026· 3 min

AI-processed from KDnuggets; edited by Hamidun News

KDnuggets: five outlier detection methods agreed on only 32 of 816 wine samples — Source: KDnuggets. Collage: Hamidun News.

◐ Listen to article

KDnuggets tested five popular outlier detection methods on a real wine dataset and got a result that breaks textbook intuition. Out of 816 samples flagged by at least one algorithm, only 32 matched across all main methods.

Why Methods Disagree

For the experiment, they took the open Wine Quality Dataset from UCI: 6497 Portuguese Vinho Verde wines, including 1599 red and 4898 white, with 11 physico-chemical features and taster ratings. This is an important detail because the data turned out to be not "textbook-like": six of eleven features showed significant distribution asymmetry, meaning classical normality assumptions work poorly here.

The first problem appeared even before comparing algorithms. If you count any sample with even one of 11 extreme features as an outlier, there are too many hits. In such a mode, IQR flagged approximately 23% of wines, and Z-Score — about 26%. The authors explain this by the multiple testing effect: even if each individual feature rarely produces a random extremum, when checking 11 columns the chance of catching an "anomaly" somewhere increases sharply. Therefore, the analysis used a stricter rule: a sample is considered suspicious only if at least two features are extreme at once.

What the Test Showed

After this adjustment, researchers compared five approaches: Robust Z-Score, IQR, Isolation Forest, Local Outlier Factor, and Elliptic Envelope. Similarity between results was weak: the Jaccard coefficient for method pairs ranged from 0.10 to 0.30. In other words, different tools looked at the same dataset and saw different "oddities". Out of 816 wines that at least one method considered outliers, only 32 samples appeared in the consensus list of all four main methods. Another 143 wines were flagged by at least three approaches. Everything else turned out to be a disputed zone: samples were unusual only from the perspective of one or two algorithms.

"The question is not which method is best, but which type of

unusualness you're looking for."

Robust Z-Score seeks strong deviations in individual features.
IQR catches extreme values well without assuming normal distribution.
Isolation Forest evaluates an object across the entire feature set.
LOF looks at how much a point stands out from its local neighborhood.
Elliptic Envelope relies on multivariate normality and turned out weaker here.

The authors also point out a trap in ML methods. Both Isolation Forest and LOF in their test used contamination=0.05, meaning the model was forced to flag 5% of objects as outliers. This is not "discovered truth," but a hard-coded quota. Therefore, identical hit rates between algorithms of this class don't mean identical quality.

What Solutions Helped

Three engineering decisions strongly influenced the outcome. First, instead of standard Z-Score they used a robust version based on median and median absolute deviation: the standard version is too sensitive to outliers themselves and in this dataset flagged only 0.8% of rows versus 3.

5% for the more robust variant. Second, red and white wines were scaled separately because they have different baseline chemical levels, and combining them without adjustment creates false anomalies. Third, Elliptic Envelope was excluded from the final "consensus vote".

The method assumes multivariate normal distribution, but in the Wine Quality Dataset this condition wasn't met: one characteristic had skewness of 5.4, several others — above 1. Excluding the method from consensus here isn't cosmetics but an example of normal analytical discipline: if assumptions are violated, the tool shouldn't determine final conclusions.

The authors also checked outliers against tasting scores from 3 to 9 points. Samples with extreme quality — very good or very bad — were about twice as likely to appear in the consensus anomaly list. This doesn't prove the algorithms "understood taste," but provides a useful sanity check: chemical deviations actually occur more often where wine stands out by expert rating too.

What This Means

The main takeaway for data science and ML practice is simple: an outlier is not an objective entity, but a result of chosen definition. If you need a workflow without labeled ground truth, it's more reasonable not to trust one algorithm, but to gather consensus from several methods and then decide together with domain expertise what to remove and what to keep as a rare but valuable signal.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation