Part.7_Feature_Scaling(ML_Andrew.Ng.)

Last updated Aug 6, 2021 Edit Source

# Feature Scaling

2021-08-06

Tags: #MachineLearning #FeatureEngineering

深入阅读的链接: https://sebastianraschka.com/Articles/2014_about_feature_scaling.html

# When to Use

在梯度下降的时候, 缩放数据可以让梯度变化更平滑

If an algorithm uses gradient descent, then the difference in ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model. Having features on a similar scale will help the gradient descent converge more quickly towards the minima.
Specifically, in the case of Neural Networks Algorithms, feature scaling benefits optimization by:
It makes the training faster
It prevents the optimization from getting stuck in local optima
It gives a better error surface shape
Weight decay and Bayes optimization can be done more conveniently

在以距离为基础的算法里面, 放缩数据可以让数据分布更均匀

Distance-based algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity and hence perform the task at hand. Therefore, we scale our data before employing a distance-based algorithm so that all the features contribute equally to the result.

在主成分分析里面, 放缩数据可以让凸显出数据的"变化", (一个数量级很大的数据变一点点»一个数量级很小的数据变化几倍)

In PCA we are interested in the components that maximize the variance. If one component (e.g. age) varies less than another (e.g. salary) because of their respective scales, PCA might determine that the direction of maximal variance more closely corresponds with the ‘salary’ axis, if those features are not scaled. As a change in the age of one year can be considered much more important than the change in salary of one euro, this is clearly incorrect.

# Normalization 归一化

$$x^\prime= \frac{x-x_{min}}{x_{max}-x_{min}}$$

可以自己调控数据分布的范围, 比如你想让数据分布在$[a,b]$范围内, 公式变为: $$x^{\prime}=a+\left(\frac{x-\min (x)}{\max (x)-\min (x)}\right)(b-a)$$
归一化对离群值十分敏感
会缩小很大的数据, 会改变数量级

# Standardization 标准化

拿正态分布的数据做例子: 就相当于把其他颜色的曲线都变成红色的那条标准正态曲线

$$x^\prime= \frac{x-\mu}{\sigma}$$ $\mu$是均值, $\sigma$是标准差(方差的平方根)

对离群值不是那么敏感
标准化也改变数量级, 会减去均值

# 如何选择 2

Normalization 在数据不符合正态分布的时候比较适用, 像KNN这种对数据分布没有要求的模型更适用于归一化

在神经网络里面常常要求数据分布在0-1之间, 这时候归一化必不可少; 另一个例子是图像处理的时候常常会把数据缩小到一个范围(比如0-255), 在这时标准化更加适用.

Standardization 在数据满足正态分布的时候更加适用, 并且在放缩的时候没有范围限制, (不像归一化可以明确的规定一个范围$[a,b]$)

在聚类中, 标准化在比较不同特征的相似性的时候很好用(why? #todo), 另一个例子是PCA的时候常常用标准化来突出数据分布的差异度, 而不是用归一化把最大的变成一. ^375f2a

总之:
- Standardization 适用于正态分布的数据,
- Normalization 适用于非正态分布的数据
Normalization里面离群值对数据的影响显著
不知道怎么办就都试试, 比较哪一个效果最好

# Feature Scaling & Regression

在多项式回归里面, 数据放缩很重要, 因为级数增长很快

# Don’t Confuse Regularization Normalization & Standardization

Regularization: 正则化
Normalization: 归一化
Standardization: 标准化

Cyan's Blog

Part.7_Feature_Scaling(ML_Andrew.Ng.)

# Feature Scaling

# When to Use

# Normalization 归一化

# Standardization 标准化

# 如何选择 2

# Feature Scaling & Regression

# Don’t Confuse Regularization Normalization & Standardization

Backlinks

Interactive Graph

Part.7_Feature_Scaling(ML_Andrew.Ng.)

# Feature Scaling

# When to Use

# Normalization 归一化

# Standardization 标准化

# 如何选择2

# Feature Scaling & Regression

# Don’t Confuse Regularization Normalization & Standardization

Backlinks

Interactive Graph

# 如何选择 2