Ridge Regression and Ridge Regression Kernel

Reference:
1. scikit-learn linear_model ridge regression
2. Machine learning for quantum mechanics in a nutshell Authors
3. sample plot ridge path code from #Fabian Pedregosa --

Ridge regression

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares:
\[\underset{w}{min} {\left\| Xw-y\right\|_2^{2}} + \lambda\left\|{w}\right\|_2^2\]
Here, \(\lambda \ge 0\) is a complexity parameter that controls the amount of shrinkage: the larger the value of \(\lambda\), the greater the amount of shrinkage and thus the coefficients become robust to collinearity. The figure below show the relationship between the \(\lambda\) and oscillations of the weights.

Ridge Regression Theory

the \(\tilde{x}\) meaning the test dataset, and \(x_i\) means the training data:
\[f(\tilde{x}) = \sum_{i=0}^n \alpha_i k(\tilde{x}, x)\]
Although the dimensionalty of \(\mathbf{Hilbert}\) space can be high, the solution lives in the finite span of the projected training data, enabling a finite representation. The corresponding convex optimization problems is:
\[\underset{\alpha \varepsilon R^{n}} {\mathrm{argmin}} \sum_{i=1}^n(f(x_i - y_i)^2 + \lambda\left \| f \right \|_{2}^{H} \]
\[\Leftrightarrow \underset{\alpha \varepsilon R^{n}} {\mathrm{argmin}} \left \langle K\alpha-y, K\alpha-y \right \rangle + \lambda \alpha^{T}K\alpha\]
Where \(\left \| f \right \|_{2}^{H}\) is the norm of \(f\) in \(\mathbf{Hilbert}\) space, the complexity of the linear ridge regression model in feature space, and \(K \varepsilon R^{n\times n}, K_{i, j}=k(x_i, x_j)\) is the kernel matrix between training sampels. As before, setting the gradient to 0 yeilds an analytic solution for the regression coefficients:
\[\alpha^{T}K^{2}\alpha - 2\alpha^{T}Ky\ + y^{T}y + \lambda\alpha^{T}K\alpha = 0 \Leftrightarrow K^{2}\alpha + \lambda K\alpha\]
\[\Leftrightarrow \alpha=(K+\lambda I)^{-1}y \]
where \(\lambda\) is a hyperparameter determining strength of regularization. The norm of the coefficient vector \(\mathbf\sigma\) is related to the smooothness and simpler models.

Figure below give a example of a KRR model with Gaussian kernel that demostates the role length-scale hyperparameter \(\sigma\). Although \(\sigma\) is directly related to the regularization term. But it's does control smoothness of the predictor, and effectively regularizes.

The figure above show that kernel ridge regression with Gaussian kernel and different length scales. We learn \(\cos(x)\), the KRR models(dasded lines).The regularization constant \(\lambda\) set to be \(10^{-14}\), a very small \(\mathbf{\sigma}\) fit the train set well but in-between error are very bigger, while a too large \(\sigma\) results in too close to linear model, with both high train error and prediction error.

From above description we can see all information about KRR model is contained in the matrix \(\mathbf K\) of kernel evaluations between training data. Similarly, all info required to predict new inputs \(\tilde{x}\) is contained in the kernel matrix of training set versus prediction data.

Kernel Ridge Regression

The regression coefficients \(\lambda\) are obtained by solving the linear system of equations \((\mathbf K + \lambda \mathbf I)\lambda=\mathbf y\), where \((\mathbf K + \lambda \mathbf I)\) is symmetric and strictly positive definite. To solve this equation we can use Cholesky decomposition \(\mathbf K + \lambda \mathbf I\), where \(\mathbf U\) is upper triangular. One then we break up the \(\mathbf{U^{T}U\lambda=y}\) equantion into 2 equations, the first is \(\mathbf{U^{T}\beta=y}\), and the other is \(\mathbf{U\lambda=\beta}\). Since \(U^{T}\) is lower triangular and \(\mathbf{U}\) is upper triangular, this requires only 2 striaghtforward passes over the data called forward and backward substitution, respectively. For \(\mathbf{U^{T}\beta=y}\), just like below:
\[\mathbf{U_{1, 1}^T \beta_1=y_1} \Leftrightarrow \beta_1=y1/u_{1,1}\]
\[\mathbf{U_{2, 1}^T \beta_1 + \mathbf{U_{2, 2}^T} \beta_1=y_1} \Leftrightarrow \beta_2=(y2 - u_{1, 2}\beta_1)/u_{2,2}\]
\[ ...\]
\[\sum_{j}^i \mathbf{U_{i, j}^T}\beta_j=y_i \Leftrightarrow \beta_i=(y_i-\sum_{j=1}^{i-1}u_{j,i}\beta_j)/u_{i,j}\]
Once the model is trained, then predictions can be made, and the prediction for a new input \(\mathbf{\tilde{x}}\) is the inner product between the vector of coefficients and the vector of corresponding kernel evaluations.

For a test datasets \(\mathbf{\tilde{X}} \varepsilon \Bbb{R^{n \times d}}\), the rows \(\mathbf{\tilde{x},...,\tilde{x_n}}\), and \(\mathbf{L} \varepsilon \Bbb{R^{n \times n}}\) is the kernel matrix of training versus prediction inputs. \(\mathbf{L_{i,j}=k(x_i, \tilde{x_j})}\)
This method has the same order of complexity than an Ordinary Least Squares. In the next notes I will introduce the detail of implementation about KRR.

Ridge Regression and Ridge Regression Kernel的更多相关文章

  1. 机器学习方法(五):逻辑回归Logistic Regression,Softmax Regression

    欢迎转载,转载请注明:本文出自Bin的专栏blog.csdn.net/xbinworld. 技术交流QQ群:433250724,欢迎对算法.技术.应用感兴趣的同学加入. 前面介绍过线性回归的基本知识, ...

  2. 机器学习---三种线性算法的比较(线性回归,感知机,逻辑回归)(Machine Learning Linear Regression Perceptron Logistic Regression Comparison)

    最小二乘线性回归,感知机,逻辑回归的比较:   最小二乘线性回归 Least Squares Linear Regression 感知机 Perceptron 二分类逻辑回归 Binary Logis ...

  3. why constrained regression and Regularized regression equivalent

    problem 1: $\min_{\beta} ~f_\alpha(\beta):=\frac{1}{2}\Vert y-X\beta\Vert^2 +\alpha\Vert \beta\Vert$ ...

  4. L1,L2范数和正则化 到lasso ridge regression

    一.范数 L1.L2这种在机器学习方面叫做正则化,统计学领域的人喊她惩罚项,数学界会喊她范数. L0范数  表示向量xx中非零元素的个数. L1范数  表示向量中非零元素的绝对值之和. L2范数  表 ...

  5. 【机器学习】Linear least squares, Lasso,ridge regression有何本质区别?

    Linear least squares, Lasso,ridge regression有何本质区别? Linear least squares, Lasso,ridge regression有何本质 ...

  6. Kernel Methods (3) Kernel Linear Regression

    Linear Regression 线性回归应该算得上是最简单的一种机器学习算法了吧. 它的问题定义为: 给定训练数据集\(D\), 由\(m\)个二元组\(x_i, y_i\)组成, 其中: \(x ...

  7. 机器学习技法:06 Support Vector Regression

    Roadmap Kernel Ridge Regression Support Vector Regression Primal Support Vector Regression Dual Summ ...

  8. 机器学习技法笔记:06 Support Vector Regression

    Roadmap Kernel Ridge Regression Support Vector Regression Primal Support Vector Regression Dual Summ ...

  9. 【Support Vector Regression】林轩田机器学习技法

    上节课讲了Kernel的技巧如何应用到Logistic Regression中.核心是L2 regularized的error形式的linear model是可以应用Kernel技巧的. 这一节,继续 ...

随机推荐

  1. bootstrap-js(3)滚动监听

    导航条实例 ScrollSpy插件根据滚动的位置自动更新导航条中相应的导航项. 拖动下面区域的滚动条,使其低于导航条的位置,注意观察active类的变化.下拉菜单中的子项也会跟着变为高亮状态. 1.调 ...

  2. ssm+easyUI datagrid 不能显示后台controller层返回的json数据

    后台打印查询出来的数据: {"total":29,"rows":[{"department_id":0,"department_n ...

  3. QF——OC字符串

    OC中的字符串: C中没有字符串类型,用字符数组和指针代替. OC中引入了字符串类型,它包括NSString 和 NSMutableString两种 NSString是不可变的,已经初始化便不能更改: ...

  4. php的cURL库介绍

    cURL 是一个利用URL语法规定来传输文件和数据的工具,支持很多协议,如HTTP.FTP.TELNET等.很多小偷程序都是使用这个函数.最爽的是,PHP也支持 cURL 库.本文将介绍 cURL 的 ...

  5. codeforces 659F . Polycarp and Hay 搜索

    题目链接 遍历每个点, 如果这个点的值能被k整除并且k/a[i][j]后小于等于n*m, 那么就对这个点进行搜索. 将这个点加入队列, 将周围的所有大于等于这个点的值的点也加入队列. 不断重复, 直到 ...

  6. listbox横向排列

    在Listbox中横向显示CheckBox 前台代码 <ListBox Height=" > <StackPanel x:Name="sp" Orien ...

  7. 黑马程序员-- C语言变量作用域与代码块

    这里通过变量作用域的两种错误用法来介绍一下C语言变量作用域 其次对代码块的使用进行了简单说明: #include <stdio.h> 1.变量的作用域(作用范围) 变量定义的那一行开始,直 ...

  8. cocos2d-x -------之笔记篇 3D动作说明

    CCShaky3D::create(时间,晃动网格大小,晃动范围,Z轴是否晃动);    //创建一个3D晃动的效果 CCShakyTiles3D::create(时间,晃动网格大小,晃动范围,Z轴是 ...

  9. Twitter模块开发

    Twitter模块开发 关于Twitter这一块,自发这篇博文之后有很多人问我,有的验证成功了不跳转,或者其它原因什么的 =======我看了一下,这篇博文里面有写呀,下面以红色粗体文字注明一下 Tw ...

  10. configure mount nfs

    qemu-img convert -f raw -O qcow2 nix.img ruiynix.qcow2 1,yum createrepo