3 Comments

Normalizing data (ie min-max scaling) has certain uses. Back when people used SVMs for things like text classification on Ngram counts, it was almost always best to do min-max scaling, either to [0, 1] or [-1,1] (cf https://neerajkumar.org/writings/svm/). Another way to think about it is that min-max scaling to [0, 1] is can be used in place of quantile normalization.

For example, you may want highly-skewed features to be transformed to have lower variance. Another example is that for discretization (ie quantization / binning / 1d clustering), you might want to have uniform constant-width bins, implemented with min-max scaling, instead of equally-populated bins, implemented with quantile normalization. Of course, discretization is usually not a good idea, but it sometimes is, particularly for dependent variables (eg whenever quantile regression is valuable).

Also -- apologies for a bit of self-promotion -- you can often obtain even better results by using a tunable "interpolation" between min-max normalization and quantile normalization, as I showed in: "The Kernel Density Integral Transformation", C. McCarter, TMLR 2023. It acts as a nonparametric alternative to variance stabilizing transforms, working even for left-skewed data, for example. And it produces far more intuitive discretization results, without requiring user specification of the number of clusters, than various clustering techniques.

Expand full comment
author

Thanks for this info, and no problem with self-promotion. It sounds like normalization has advantages for some ML algorithms, but do you think it is fair to say it has no advantage as a pre-processing step for OLS?

Expand full comment

Yes, I've never come across a circumstance where it seemed statistically appropriate for OLS. Having said that, it is computationally convenient when one has a massive sparse dataset with non-negative features (ie data.min() is 0), because it preserves sparsity. And if one is doing Lasso regression afterwards, one can then apply a Lasso screening rule to the preprocessed features (which will tend to remove the sparse-and-right-skewed features), and thus easily avoid ever materializing the data as a dense matrix during Lasso fitting.

Expand full comment