Cyan's Blog

Search

Search IconIcon to open search

Why_do_cost_functions_use_the_square_error

Last updated Jul 31, 2021 Edit Source

# Why do cost functions use the square error?

2021-07-31

Tags: #MachineLearning #CostFunction #MeanSquareError

Reference: StackExchange: why-do-cost-functions-use-the-square-error?

StackExchange上面一个关于均方差的一个很好的解释, 翻译如下:

# Question:

I’m just getting started with some machine learning, and until now I have been dealing with linear regression over one variable. I have learnt that there is a hypothesis, which is: $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ To find out good values for the parameters $\theta_{0}$ and $\theta_{1}$ we want to minimize the difference between the calculated result and the actual result of our test data. So we subtract $h_{\theta}\left(x^{(i)}\right)-y^{(i)}$ for all $i$ from 1 to $m$. Hence we calculate the sum over this difference and then calculate the average by multiplying the sum by $\frac{1}{m}$. So far, so good. This would result in: $\frac{1}{m} \sum_{i=1}^{m} \left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$ But this is not what has been suggested. Instead the course suggests to take the square value of the difference, and to multiply by $\frac{1}{2 m}$. So the formula is: $\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ Why is that? Why do we use the square function here, and why do we multiply by $\frac{1}{2 m}$ instead of $\frac{1}{m} ?$

我是一个机器学习的初学者, 现在正在学习一元线性回归问题. 我学到了下面这个假设函数: $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ 为了找到参数$\theta_{0}$ 和 $\theta_{1}$ 的最优值, 我们需要使预测值与真实值之间的误差最小, 所以我们把他们相减: $h_{\theta}\left(x^{(i)}\right)-y^{(i)}$ 其中 $i$ 取遍 $1$ 到 $m$. 然后我们计算所有误差的和, 并且乘上 $\frac{1}{m}$得到误差的平均数, 得到: $\frac{1}{m} \sum_{i=1}^{m} \left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)$ 但是这并不是正确的公式, 课程里面说我们需要把误差进行平方, 然后乘以$\frac{1}{2 m}$, 所以应该是 $\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ 为什么要这样做? 为什么我们需要将误差进行平方, 并且乘上$\frac{1}{2 m}$ 而不是 $\frac{1}{m} ?$

# Answer:

Your loss function would not work because it incentivizes setting $\theta_{1}$ to any finite value and $\theta_{0}$ to $-\infty$.

Let’s call $r(x, y)=\frac{1}{m} \sum_{i=1}^{m} \left(h_{\theta}\left(x^{(i)}\right)-y\right)$ the residual for $h$.

Your goal is to make $r$ as close to zero as possible, not just minimize it. A high negative value is just as bad as a high positive value.

EDIT: You can counter this by artificially limiting the parameter space $\boldsymbol{\Theta}$ (e.g. you want $\left|\theta_{0}\right|<\mathbf{1 0}$ ). In this case, the optimal parameters would lie on certain points on the boundary of the parameter space. See https://math.stackexchange.com/q/896388/12467. This is not what you want.

# Why do we use the square loss

The squared error forces $h(x)$ and $y$ to match. It’s minimized at $\boldsymbol{u}=v$, if possible, and is always $\geq 0$, because it’s a square of the real number $\boldsymbol{u}-\boldsymbol{v}$.

$|\boldsymbol{u}-\boldsymbol{v}|$ would also work for the above purpose, as would $(\boldsymbol{u}-\boldsymbol{v})^{2 n}$, with $\boldsymbol{n}$ some positive integer. The first of these is actually used (it’s called the $\ell_{1}$ loss; you might also come across the $\ell_{2}$ loss, which is another name for squared error).

So, why is the squared loss better than these? This is a deep question related to the link between Frequentist and Bayesian inference. In short, the squared error relates to Gaussian Noise.

If your data does not fit all points exactly, i.e. $h(x)-y$ is not zero for some point no matter what $\theta$ you choose (as will always happen in practice), that might be because of noise. In any complex system there will be many small independent causes for the difference between your model $h$ and reality $y$ : measurement error, environmental factors etc. By the Central Limit Theorem $(\mathrm{CLT})$, the total noise would be distributed Normally, i.e. according to the Gaussian distribution. We want to pick the best fit $\boldsymbol{\theta}$ taking this noise distribution into account. Assume $\boldsymbol{R}=\boldsymbol{h}(\boldsymbol{X})-\boldsymbol{Y}$, the part of $\mathbf{y}$ that your model cannot explain, follows the Gaussian distribution $\mathcal{N}(\mu, \sigma)$. We’re using capitals because we’re talking about random variables now.

The Gaussian distribution has two parameters, mean $\mu=\mathbb{E}[R]=\frac{1}{m} \sum_{i}\left(h_{\theta}\left(X^{(i)}\right)-Y^{(i))}\right)$ and variance $\sigma^{2}=E\left[R^{2}\right]=\frac{1}{m} \sum_{i}\left(h_{\theta}\left(X^{(i)}\right)-Y^{(i))}\right)^{2}$. See here to understand these terms better.

To simultaneously take both the mean and variance into account, we include a bias term in our classifier (to handle systematic error μ), then minimize the square loss.

Followup questions:

# Regarding the $\frac1 2$ term

The 1/2 does not matter and actually, neither does the $m$ - they’re both constants. The optimal value of $\theta$ would remain the same in both cases.


你的损失函数并不正确, 因为它倾向于将$\theta_{1}$ 设置为任意有限值,并且将 $\theta_{0}$ 设置为 $-\infty$.

我们不妨把$r(x, y)=\frac{1}{m} \sum_{i=1}^{m} \left(h_{\theta}\left(x^{(i)}\right)-y\right)$ 称为 $h$的残差.

你的目标是让$r$ 尽可能地接近0, 不是让其尽可能地小. 一个(绝对值)很大的负数和一个很大的整数一样糟糕.

: 你也可以人为限制参数的变化范围 $\boldsymbol{\Theta}$ (比如: 令$\left|\theta_{0}\right|<\mathbf{1 0}$ ). 此时,你的方法得到的最优参数会是很靠近边界的一个值. (参见 https://math.stackexchange.com/q/896388/12467). 这并不是我们想要的结果.

# 为什么要使用平方误差

平方误差会让 $h(x)$ 靠近 $y$. 它在 $\boldsymbol{u}=v$的时候取得最小值, 并且因为它是实数 $\boldsymbol{u}-\boldsymbol{v}$的平方, 它始终$\geq 0$.

$|\boldsymbol{u}-\boldsymbol{v}|$ 也有一样的效果, 正如$(\boldsymbol{u}-\boldsymbol{v})^{2 n}$, ($\boldsymbol{n}$ 是任意正数)也一样. 绝对值误差其实在实际问题中也用到了 (称为 $\ell_{1}$ 误差; 你有时也会看到 $\ell_{2}$ 误差, 这是平方误差的另一种称呼).

所以为什么平方误差比它们都好? 这个问题其实十分深入, 它涉及到了频率学派推断1和贝叶斯推断2之间的联系. 简而言之, 平方误差其实和高斯噪声有关. 3

如果你的预测值和真实值总是对不上, 也就是无论$\theta$选什么值, 总有一些点$h(x)-y$ 不为零(这很常见). 那么很可能你的数据有噪声. 在一个复杂的系统中, 许多微小但是相互独立的因素会使 $h$ 和真实值 $y$不一样 : 比如测量误差, 环境因素等等. 根据 中心极限定理 Central Limit Theorem$(\mathrm{CLT})$ , 整体上这些噪声会呈正态分布, 也就是说, 它们服从 高斯分布(也称正态分布). 我们在选取 $\boldsymbol{\theta}$ 的时候, 也需要尽可能地考虑到这些因素. 假设 $\boldsymbol{R}=\boldsymbol{h}(\boldsymbol{X})-\boldsymbol{Y}$, 其中$\mathbf{y}$的有一部分是你的模型无法解释的噪声, 服从高斯分布$\mathcal{N}(\mu, \sigma)$. (我们之所以使用大写字母, 是因为它们都代表随机变量)

高斯分布包含两个变量: 期望值$\mu=\mathbb{E}[R]=\frac{1}{m} \sum_{i} h_{\theta}\left(X^{(i)}\right)-Y^{(i))}$ 和方差 $\sigma^{2}=E\left[R^{2}\right]=\frac{1}{m} \sum_{i}\left(h_{\theta}\left(X^{(i)}\right)-Y^{(i))}\right)^{2}$. 如果你想要了解更多,可以参考 这个链接 .

为了同时考虑数学期望和方差, 我们在判别器里面引入了一个偏置 (为了修正系统误差μ), 然后最小化平方误差.

进一步的问题:

# 关于损失函数里面的$\frac1 2$

其实这个 1/2 并不重要, $m$ 也是 - 他们都是常数. $\theta$ 的最优值与它们无关.


  1. https://en.wikipedia.org/wiki/Frequentist_inference ↩︎

  2. https://en.wikipedia.org/wiki/Bayesian_inference ↩︎

  3. https://en.wikipedia.org/wiki/Gaussian_noise ↩︎

  4. 在高中学习的时候并没有讲解最小二乘法名字的含义, 现在看来, 可以理解成"取最小的平方误差的方法" ↩︎

  5. Related_Post ↩︎