[C3] 正则化(Regularization)

正则化（Regularization - Solving the Problem of Overfitting）

欠拟合(高偏差) VS 过度拟合(高方差)

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

It is usually caused by a function that is too simple or uses too few features.

欠拟合(高偏差)：没有很好的拟合训练集数据；

At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

过度拟合(高方差)：可以很好的拟合训练集数据，但是函数太过庞大，变量太多，且缺少足够多的数据约束该模型(m < n)，无法泛化到新的数据样本。

This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:

两种方法解决过度拟合：

Reduce the number of features

Manually select which features to keep
Use a model selection algorithm (studied later in the course)

Regularization

Keep all the features, but reduce the magnitude of parameters \(\theta_j\).
Regularization works well when we have a lot of slightly useful features.

正则化 - 线性回归代价函数

所有正则化均不包括 \(\theta_0\) 项

\(J(\theta)=\frac{1}{2m} \Bigg[ \sum\limits_{i=1}^m \Big( h_\theta(x^{(i)}) - y^{(i)} \Big)^2 + \lambda \sum\limits_{j=1}^n \theta_j^2 \Bigg]\)

向量化表示为（A vectorized implementation is）：

\(\overrightarrow{h}=g(X \overrightarrow{\theta})\)

\(J(\theta)=\frac{1}{2m} \cdot \Bigg[ (\overrightarrow{h}-\overrightarrow{y})^T \cdot (\overrightarrow{h}-\overrightarrow{y}) + \lambda \cdot (\overrightarrow{l} \cdot \overrightarrow{\theta}^{.2}) \Bigg]\)

\(\overrightarrow{l} = [0, 1, 1, ...1]\)

代码实现：

m = length(y);

l = ones(1, length(theta)); l(:,1) = 0;

J = 1/(2*m) * ((X * theta - y)' * (X * theta - y) + lambda * (l * (theta.^2));

or 

J = 1/(2*m) * ((X * theta - y)' * (X * theta - y) + lambda * (theta'*theta - theta(1,:).^2);

正则化 - 逻辑回归代价函数

所有正则化均不包括 \(\theta_0\) 项

\(J(\theta)=-\frac{1}{m} \sum\limits_{i=1}^m \Bigg[ y^{(i)} \cdot log \bigg(h_\theta(x^{(i)}) \bigg) + (1-y^{(i)}) \cdot log \bigg(1-h_\theta(x^{(i)}) \bigg) \Bigg] + \frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\)

向量化表示为（A vectorized implementation is）：

\(\overrightarrow{h}=g(X \overrightarrow{\theta})\)

\(J(\theta)=\frac{1}{m} \cdot \Big( -\overrightarrow{y}^T \cdot log(\overrightarrow{h}) - (1- \overrightarrow{y})^T \cdot log(1- \overrightarrow{h}) \Big) + \frac{\lambda}{2m} (\overrightarrow{l} \cdot \overrightarrow{\theta}^{.2})\)

\(\overrightarrow{l} = [0, 1, 1, ...1]\)

代码实现：

m = length(y);

l = ones(1, length(theta)); l(:,1) = 0;

J = (1/m)*(-y'*log(sigmoid(X*theta))-(1 - y)'* log(1-sigmoid(X*theta))) + ...

    (lambda/(2*m))*(l*(theta.^2)); 

or 

J = (1/m)*(-y'*log(sigmoid(X*theta))-(1 - y)'* log(1-sigmoid(X*theta))) + ...

    (lambda/(2*m))*(theta'*theta - theta(1,:).^2);

正则化后的线性回归和逻辑回归梯度下降

所有正则化均不包括 \(\theta_0\) 项

\(\begin{cases} \theta_0:=\theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^m \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) \cdot x_0^{(i)} \\ \\ \theta_j:=\theta_j - \alpha \Bigg[ \frac{1}{m} \sum\limits_{i=1}^m \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) \cdot x_j^{(i)} + \frac{\lambda}{m} \cdot \theta_j \Bigg] \end{cases}\)

向量化表示为（A vectorized implementation is）：

\(\frac{1}{m} \cdot \Big( X^T \cdot (\overrightarrow{h} - \overrightarrow{y}) \Big) + \frac{\lambda}{m} \cdot \theta^{'}\)

\(\theta^{'} = \begin{bmatrix} 0\\[0.3em]\theta_1\\[0.3em]\theta_2\\[0.3em].\\[0.3em].\\[0.3em].\\[0.3em]\theta_n \end{bmatrix}\)

代码实现：

reg_theta=theta; reg_theta(1, :) = 0;

grad = (1/m)*(X'*(sigmoid(X*theta) - y)) + (lambda/m)*reg_theta;

最终形式：对 \(\theta_j\) 的梯度下降公式进行整理变形（With some manipulation our update rule can also be represented as）：

\(\begin{cases} \theta_0:=\theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^m \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) \cdot x_0^{(i)} \\ \\ \theta_j:=\theta_j (1- \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum\limits_{i=1}^m \Big( h_\theta(x^{(i)}) - y^{(i)} \Big) \cdot x_j^{(i)} \end{cases}\)

对线性回归正规方程进行正则化

所有正则化均不包括 \(\theta_0\) 项

\(1 - \alpha\frac{\lambda}{m}\) will always be less than 1. Intuitively you can see it as reducing the value of \(\theta_j\) by some amount on every update. Notice that the second term is now exactly the same as it was before.

Now let's approach regularization using the alternate method of the non-iterative normal equation.

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

原始形态 \(\overrightarrow{\theta} = (X^TX)^{-1}X^T \overrightarrow{y}\)

正则化后 \(\overrightarrow{\theta} = (X^TX + \lambda L)^{-1}X^T \overrightarrow{y}\)

\(L = \begin{bmatrix} 0&&&&&&\\[0.3em]&1&&&&&\\[0.3em]&&1&&&&\\[0.3em]&&&·&&&\\[0.3em]&&&&·&&\\[0.3em]&&&&&·&\\[0.3em]&&&&&&1\end{bmatrix}\)

L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1).

Intuitively, this is the identity matrix (though we are not including \(x_0\)）multiplied with a single real number \(\lambda\).

Recall that if m < n, then \(X^TX\) is non-invertible. However, when we add the term \(\lambda⋅L\), then \(X^TX + \lambda⋅L\) becomes invertible.

程序代码

正则化的特性已经全部添加到了其他练习代码中，如线性回归，逻辑回归，神经网络等。可在其他练习中查看到，如需非正则化，只要将Lambda=0即可。

获取源码以其他文件，可点击右上角 Fork me on GitHub 自行 Clone。