Regrets and Regression

Jul 01, 2024

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

Standardization and Normalization

Here’s a question from the Reddit statistics forum.

I want to write a research article that has regression analysis in it. I normalized my independent variables and want to include in the article the results of a statistical test showing that all variables are normal. I normalized using the scale function in R and some custom normalization functions I found online but whatever I do the new data fails the Shapiro Wilkinson and KS test on some columns? What to do?

There might be a few points of confusion here:

One is the idea that the independent variables in a regression model have to follow a normal distribution. This is a common misconception. Ordinary least squares regression assumes that the residuals follow a normal distribution — the dependent and independent variables don’t have to.
Another is the idea that statistical tests are useful for checking whether a variable follows a normal distribution. As I’ve explained in other articles, these tests don’t do anything useful.
The question might also reveal confusion about what “normalization” means. In the context of regression models, it usually means scaling a variable so its range is between 0 and 1. Its sounds like OP might be using it to mean transforming a variable so it follows a normal distribution.

In general, transforming an independent variable to normal is not a good idea. It has no benefit, and it changes the estimated parameters for no justified reason. Let me explain.

Click here to run this notebook on Colab.

Happiness

As an example, I’ll use data from the World Happiness Report, which uses survey data from 153 countries to explore the relationship between self-reported happiness and six potentially predictive factors:

Income as represented by per capita GDP.
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption

The dependent variable, happiness, is the national average of responses to the “Cantril ladder question” used by the Gallup World Poll:

Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?

The data they used is available here.

In [3]:

download('https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls')

Out[3]:

'WHR20_DataForFigure2.1.xls'

We can read the data into a Pandas DataFrame.

In [4]:

df = pd.read_excel('WHR20_DataForFigure2.1.xls')

I’ll select the columns we’re interested in and give them shorter names.

In [5]:

data = pd.DataFrame()

data['ladder'] = df['Ladder score']
data['log_gdp'] = df['Logged GDP per capita']
data['social'] = df['Social support']
data['life_exp'] = df['Healthy life expectancy']
data['freedom'] = df['Freedom to make life choices']
data['generosity'] = df['Generosity']
data['corruption'] = df['Perceptions of corruption']

Notice that GDP is already on a log scale, which is generally a good idea. The other variables are in a variety of units and different scales.

Now, contrary to the advice OP was given, let’s run a regression model without even looking at the data.

In [6]:

import statsmodels.formula.api as smf

formula = 'ladder ~ log_gdp + social + life_exp + freedom + generosity + corruption'

results_raw = smf.ols(formula, data=data).fit()
results_raw.params

Out[6]:

Intercept    -2.059377
log_gdp       0.229079
social        2.723318
life_exp      0.035307
freedom       1.776815
generosity    0.410566
corruption   -0.628162
dtype: float64

Because the independent variables are on different scales, the estimated parameters are in different units. For example, life_exp is healthy life expectancy at birth in units of years, so the estimated parameter, 0.035, indicates that a difference of one year of life expectancy between countries is associated with an increase of 0.035 units on the Cantril ladder.

Now let’s check retroactively whether this dataset is consistent with the assumption that the residuals follow a normal distribution. I’ll use the following functions to compare the CDF of the residuals to a normal model with the same mean and standard deviation.

In [7]:

from scipy.stats import norm
from empiricaldist import Cdf

def make_normal_model(sample):
    """Make a Cdf of a normal distribution.
    
    sample: sequence of numbers
    
    returns: Cdf object
    """
    m, s = sample.mean(), sample.std()
    qs = np.linspace(m - 4 * s, m + 4 * s, 101)
    ps = norm.cdf(qs, m, s)
    return Cdf(ps, qs)

In [8]:

def plot_normal_model(sample, **options):
    """Plot the Cdf of a sample and a normal model.
    
    sample: sequence of numbers
    """
    cdf_model = make_normal_model(sample)
    cdf_data = Cdf.from_seq(sample)

    cdf_model.plot(color='gray')
    cdf_data.plot(**options)

In [9]:

plot_normal_model(results_raw.resid, label='residuals')
decorate(xlabel='residuals', ylabel='CDF')

No description has been provided for this image

The normal model fits the residuals well enough that the deviations are not a concern. This regression model is just fine with no transformations required.

But suppose we follow the advice OP received and check whether the dependent and independent variables follow normal distributions.

Looking for trouble

Here are the distributions of the variables compared to a normal model. Sometimes the model fits pretty well, sometimes not.

In [10]:

column = 'ladder'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [11]:

column = 'log_gdp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [12]:

column = 'social'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [13]:

column = 'life_exp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [14]:

column = 'freedom'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [15]:

column = 'generosity'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [16]:

column = 'corruption'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

If we ran a statistical test, most of these would fail. But we don’t care — regression models don’t require these variables to follow normal distributions. And when we make this kind of comparison across countries, there no reason to expect them to.

Standardize

Nevertheless, there are reasons we might want to transform the independent variables before running a regression model — one is to quantify the “importance” of the different factors. We can do that by standardizing the variables, which means transforming them to have mean 0 and standard deviation 1.

In the untransformed variables, the means vary in magnitude from about 0.1 to 64.

In [17]:

data.mean().sort_values()

Out[17]:

generosity    -0.014568
corruption     0.733120
freedom        0.783360
social         0.808721
ladder         5.473240
log_gdp        9.295706
life_exp      64.445529
dtype: float64

And the standard deviations vary from about 0.1 to 7.

In [18]:

data.std().sort_values()

Out[18]:

freedom       0.117786
social        0.121453
generosity    0.151809
corruption    0.175172
ladder        1.112270
log_gdp       1.201588
life_exp      7.057848
dtype: float64

And, as I’ve already noted, they are expressed in different units. There is no meaningful way to compare the parameter of life_exp, which is steps of the Cantril ladder per year of life expectancy, with the parameter of corruption, which is in steps of the ladder per percentage point. But we can make these comparisons meaningful by standardizing the variables — that is, subtracting off the mean and dividing by the standard deviation.

In [19]:

standardized = (data - data.mean()) / data.std()

After transformation, the means are all close to 0.

In [20]:

standardized.mean()

Out[20]:

ladder       -2.786442e-16
log_gdp       6.966105e-16
social       -3.715256e-16
life_exp     -5.108477e-16
freedom      -3.250849e-16
generosity   -2.612289e-17
corruption   -2.554239e-16
dtype: float64

And the standard deviations are all close to 1.

In [21]:

standardized.std()

Out[21]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

Here’s the regression with the transformed variables.

In [22]:

results_standardized = smf.ols(formula, data=standardized).fit()
results_standardized.params

Out[22]:

Intercept    -4.072828e-16
log_gdp       2.474749e-01
social        2.973701e-01
life_exp      2.240379e-01
freedom       1.881597e-01
generosity    5.603632e-02
corruption   -9.892975e-02
dtype: float64

The intercept is close to 0 — that’s a consequence of how the math works out. And now the magnitudes of the parameters indicate how much a change in each independent variable, expressed as a multiple of its standard deviation, is associated with a change in the dependent variables, also as a multiple of its standard deviation.

By this metric, social is the most important factor, followed closely by log_gdp and life_exp. generosity is the least important factor, at least as its quantified by this variable.

Normalizing

An alternative to standardization is normalization, which transforms a variable so its range is between 0 and 1. To be honest, I’m not sure what the point of normalization is — it makes the results a little harder to interpret, and it doesn’t have any advantages I can think of.

But just for completeness, here’s how it’s done.

In [23]:

normalized = (data - data.min()) / (data.max() - data.min())

As intended, the normalized variables have the same range.

In [24]:

normalized.min()

Out[24]:

ladder        0.0
log_gdp       0.0
social        0.0
life_exp      0.0
freedom       0.0
generosity    0.0
corruption    0.0
dtype: float64

In [25]:

normalized.max()

Out[25]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

But as a result they have somewhat different means and standard deviations.

In [26]:

normalized.mean()

Out[26]:

ladder        0.554455
log_gdp       0.565357
social        0.746725
life_exp      0.608947
freedom       0.668690
generosity    0.332345
corruption    0.754826
dtype: float64

In [27]:

normalized.std()

Out[27]:

ladder        0.212192
log_gdp       0.242351
social        0.185365
life_exp      0.223317
freedom       0.203633
generosity    0.176200
corruption    0.212124
dtype: float64

Here’s the regression with normalized variables.

In [28]:

results_normalized = smf.ols(formula, data=normalized).fit()
results_normalized.params

Out[28]:

Intercept    -0.030706
log_gdp       0.216678
social        0.340407
life_exp      0.212877
freedom       0.196069
generosity    0.067483
corruption   -0.098962
dtype: float64

We’ll look more closely at those results soon, but first let’s look at one more preprocessing option, transforming values to follow a normal distribution.

Transform to normal

Here’s how we transform values to a normal distribution.

In [29]:

n = len(data)
ps = np.arange(1, n+1) / (n + 1)
zs = norm.ppf(ps)

transformed = data.copy()
for column in transformed.columns:
    transformed.sort_values(column, inplace=True)
    transformed[column] = zs

The ps are the cumulative probabilities; the zs are the corresponding quantiles in a standard normal distribution (ppf computes the “percent point function” which is an unnecessary name for the inverse CDF). Inside the loop, we sort transformed by one of the columns and then assign the zs in that order.

The transformed means are all close to 0.

In [30]:

transformed.mean()

Out[30]:

ladder        4.644070e-17
log_gdp       4.644070e-17
social        1.161018e-17
life_exp      0.000000e+00
freedom       0.000000e+00
generosity    4.644070e-17
corruption   -4.644070e-17
dtype: float64

And the standard deviations are close to 1 (but not exact because the tails of the distribution are cut off).

In [31]:

transformed.std()

Out[31]:

ladder        0.975034
log_gdp       0.975034
social        0.975034
life_exp      0.975034
freedom       0.975034
generosity    0.975034
corruption    0.975034
dtype: float64

As intended, the transformed variables follow a normal distribution.

In [32]:

plot_normal_model(transformed['corruption'], label='corruption')
decorate(ylabel='CDF')

Here’s the regression model with the transformed values.

In [33]:

results_transformed = smf.ols(formula, data=transformed).fit()
results_transformed.params

Out[33]:

Intercept     5.425248e-18
log_gdp       2.382770e-01
social        3.642517e-01
life_exp      1.951131e-01
freedom       2.028179e-01
generosity    3.216042e-02
corruption   -4.517596e-02
dtype: float64

Now let’s compare the estimated parameters with the three transformations.

In [34]:

results_standardized.params.plot(style='o', label='standardized')
results_normalized.params.plot(style='x', label='normalized')
results_transformed.params.plot(style='+', label='transformed')

decorate(ylabel='Estimated parameter')

The results with standardized variables are the easiest to interpret, so in that sense they are “right”. The other transformations have no advantages over standardization, and they produce somewhat different parameters.

Discussion

Regression does not require the dependent or independent variables to follow a normal distribution. It only assumes that the distribution of residuals is approximately normal, and even that requirement is not strict — regression is quite robust.

You do not need to check whether any of these variables follows a normal distribution, and you definitely should not use a statistical test.

As part of your usual exploratory data analysis, you should look at the distribution of all variables, but that’s primarily to detect any anomalies, outliers, or really extreme distributions — not to check whether they fit normal distributions.

If you want to use the parameters of the model to quantify the importance of the independent variables, use standardization. There’s no reason to use normalization, and definitely no reason to transform to a normal distribution.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Calvin McCarter

Jul 2

Normalizing data (ie min-max scaling) has certain uses. Back when people used SVMs for things like text classification on Ngram counts, it was almost always best to do min-max scaling, either to [0, 1] or [-1,1] (cf https://neerajkumar.org/writings/svm/). Another way to think about it is that min-max scaling to [0, 1] is can be used in place of quantile normalization.

For example, you may want highly-skewed features to be transformed to have lower variance. Another example is that for discretization (ie quantization / binning / 1d clustering), you might want to have uniform constant-width bins, implemented with min-max scaling, instead of equally-populated bins, implemented with quantile normalization. Of course, discretization is usually not a good idea, but it sometimes is, particularly for dependent variables (eg whenever quantile regression is valuable).

Also -- apologies for a bit of self-promotion -- you can often obtain even better results by using a tunable "interpolation" between min-max normalization and quantile normalization, as I showed in: "The Kernel Density Integral Transformation", C. McCarter, TMLR 2023. It acts as a nonparametric alternative to variance stabilizing transforms, working even for left-skewed data, for example. And it produces far more intuitive discretization results, without requiring user specification of the number of clusters, than various clustering techniques.

Expand full comment

2 replies by Allen Downey and others

2 more comments...

Probably Overthinking It

Discussion about this post