Cyan's Blog

Search

Search IconIcon to open search

Part.7_Feature_Scaling(ML_Andrew.Ng.)

Last updated Aug 6, 2021 Edit Source

# Feature Scaling

2021-08-06

Tags: #MachineLearning #FeatureEngineering

1

深入阅读的链接: https://sebastianraschka.com/Articles/2014_about_feature_scaling.html

# When to Use

If an algorithm uses gradient descent, then the difference in ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model. Having features on a similar scale will help the gradient descent converge more quickly towards the minima.

Specifically, in the case of Neural Networks Algorithms, feature scaling benefits optimization by:

  • It makes the training faster
  • It prevents the optimization from getting stuck in local optima
  • It gives a better error surface shape
  • Weight decay and Bayes optimization can be done more conveniently

Distance-based algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity and hence perform the task at hand. Therefore, we scale our data before employing a distance-based algorithm so that all the features contribute equally to the result.

In  PCA we are interested in the components that maximize the variance. If one component (e.g. age) varies less than another (e.g. salary) because of their respective scales, PCA might determine that the direction of maximal variance more closely corresponds with the ‘salary’ axis, if those features are not scaled. As a change in the age of one year can be considered much more important than the change in salary of one euro, this is clearly incorrect.

# Normalization 归一化

$$x^\prime= \frac{x-x_{min}}{x_{max}-x_{min}}$$

# Standardization 标准化

Probability density function for the Normal distribtion|500

$$x^\prime= \frac{x-\mu}{\sigma}$$ $\mu$是均值, $\sigma$是标准差(方差的平方根)

# 如何选择2

Normalization 在数据不符合正态分布的时候比较适用, 像KNN这种对数据分布没有要求的模型更适用于归一化

在神经网络里面常常要求数据分布在0-1之间, 这时候归一化必不可少; 另一个例子是图像处理的时候常常会把数据缩小到一个范围(比如0-255), 在这时标准化更加适用.

Standardization 在数据满足正态分布的时候更加适用, 并且在放缩的时候没有范围限制, (不像归一化可以明确的规定一个范围$[a,b]$)

在聚类中, 标准化在比较不同特征的相似性的时候很好用(why? #todo), 另一个例子是PCA的时候常常用标准化来突出数据分布的差异度, 而不是用归一化把最大的变成一. ^375f2a

# Feature Scaling & Regression

在多项式回归里面, 数据放缩很重要, 因为级数增长很快

# Don’t Confuse Regularization Normalization & Standardization


  1. https://sebastianraschka.com/Articles/2014_about_feature_scaling.html ↩︎

  2. https://www.atoti.io/when-to-perform-a-feature-scaling/ ↩︎