[Machine Learning]学习笔记-线性回归

模型

假定有i组输入输出数据。输入变量可以用$x^i$表示，输出变量可以用$y^i$表示，一对$\{x^i,y^i\}$名为训练样本(training example)，它们的集合则名为训练集(training set)。
假定$X$有j个特征，则可以用集合${x^i_1,x^i_2,\dots ,x^i_j}$表示。
为了描述模型，要建立假设方程(hypothesis function) :
$ h:X\to Y$。
$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$
也可以写成矩阵形式：
$\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}$
(备注：一般一维向量都写成列向量)
评价假设方程的准确性，可以用代价函数(cost function)。

代价函数

代价函数可以表示为遍历每个样本，求预测值和实际值的残差平方和的均值。
$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$

显然，代价函数值越小，假设方程越准确。
由此可引入两种方法-梯度下降(Gradient Descent)和正规方程(Normal Equation)来调整参数$\theta$使$J$的值最小。

梯度下降

The gradient descent algorithm is:

repeat until convergence:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$

求偏导(舍去m)：
\[\begin{equation*}
\begin{split}
\frac{\partial}{\partial \theta_j} J(\theta) & = \frac{\partial}{\partial \theta_j}\frac{1}{2m}( h_\theta(\boldsymbol{x})-y)^2 \\
& =2\cdot\frac{1}{2m}\cdot( h_\theta(\boldsymbol{x})-y)\cdot\frac{\partial}{\partial \theta_j}( h_\theta(\boldsymbol{x})-y) \\
& = \frac{1}{m}(h_\theta(\boldsymbol{x})-y)\cdot \frac{\partial}{\partial \theta_j}(\sum_{i=0}^{n}\theta_i x_i-y) \\
& =\frac{1}{m} (h_\theta(\boldsymbol{x})-y)x_j \\
\end{split}
\end{equation*}\]
$\alpha$为学习速率(learning rate),对应上图的步长。
对于一条样本，可得：
$\theta_j := \theta_j - \alpha \frac{1}{m} (h_\theta(x^i)-y^i)x_{j}^{i}$

这就是有名的LMS更新原则，也叫Widrow-Hoff学习准则，参数 θ 更新的幅度取决于误差项的大小。从一对样本的情况，我们推导出参数θ
如何更新使得函数可以收敛。事实上，对于含有多个训练样本的情况，有两个方法可以对参数θ 进行更新，一个是 batch model，另外一个是stochastic model。

(PS:这篇博客介绍的很详细，但最后两个公式的正负号错了。)

batch mode:

每次更新都遍历所有样本
\[
\begin{align*} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align*}
\]

特征缩放(Feature Scaling)

在使用梯度下降算法前，最好对每个特征进行归一化操作。
归一化公式：
\[x_j := \dfrac{x_j - \mu_j}{s_j}\]
$\mu_j-样本均值$
$s_j -样本方差$

正规方程

公式

推导过程
$\theta = (X^T X)^{-1}X^T y$

与梯度下降的对比

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate
O (kn2)	O (n3), need to calculate inverse of XTX
Works well when n is large	Slow if n is very large