Pingouin: how to build a statistical EDA pipeline
Pingouin is a Python library for statistical analysis. It can be used to build a comprehensive EDA pipeline that checks key data properties: distribution normal

Most analysts and data scientists start exploring data with visualization: they build charts, look at distributions, calculate basic statistics. It's a good start, but often not enough. Pingouin is a Python library that turns exploratory data analysis (EDA) into a systematic statistical process.
Why Statistical EDA is Critical
Visualization answers the questions 'what do we see?' and 'how does it look?'. But for a reliable model, you need more serious answers:
- Are variables normally distributed?
- Are there significant correlations between features?
- Which variables make sense to select for the model?
- Where are outliers and anomalies hidden?
- Which statistical assumptions are violated?
Without these answers, your model will be fragile. At the EDA stage, it's easier to rebuild features or filter data than to retrain the model later.
What Pingouin Can Do
The library contains ready-made functions for basic statistical tests. Instead of remembering formulas or writing long blocks of pandas and scipy code, you call a single function. Key capabilities:
- Normality tests (Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling) — check whether distributions are normal
- Correlation analysis — compute Pearson, Spearman, Kendall coefficients with p-value
- Homogeneity of variance tests (Levene, Bartlett) — compare the spread of values in groups
- Outlier detection — IQR, Z-score, Mahalanobis methods
- ANOVA and post-hoc tests — analysis of differences between groups
- Effect size (Cohen's d, eta-squared) — practical significance of results
Typical EDA Pipeline with Pingouin
The pipeline consists of sequential verification steps:
Loading and basic cleaning. Read the data, removed duplicates, processed missing values in a standard way.
Distribution checking. For each numeric variable, called `pg.normality()`. If p-value > 0.05, the variable is normal. If not — you need a transformation (log, sqrt, or Box-Cox).
Correlation analysis. Calculated the correlation matrix with `pg.corr()`, identified significant relationships (p < 0.05). High correlations (> 0.9) indicate multicollinearity.
Outlier detection. Applied several methods (IQR, Z-score) and compared results. Outliers can be removed, selected separately, or processed with transformations.
Model assumptions check. If you plan linear regression — check homoscedasticity (uniform error variance), absence of multicollinearity, linearity of relationships.
Documentation. Record which variables violate assumptions, which ones you excluded and why. This will be useful when interpreting results.
"Good EDA is a dialogue with data, not a monologue of beautiful charts."
What This Means
Tools like Pingouin democratize access to statistical analysis. You no longer need to remember test names or search for the right one in scipy documentation — there's a ready-made solution in just a few lines. It's especially useful at early project stages, when you need to quickly understand what data you're dealing with and what preparatory steps will be needed.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.