Regularized Linear Regression with scikit-learn

Earlier we covered Ordinary Least Squares regression. In this posting we will build upon this foundation and introduce an important extension to linear regression, regularization, that makes it applicable for ill-posed problems (e.g. number of predictors >> number of samples) and helps to prevent overfitting.

This is part of a series of blog posts showing how to do common statistical learning techniques with Python. We provide only a small amount of background on the concepts and techniques we cover, so if you’d like a more thorough explanation check outIntroduction to Statistical Learning or sign up for the free online course run by the book’s authors here.

Regularized Linear Regression

In a previous posting we introduced linear regression and polynomial regression. Polynomial regression fits a n-th order polynomial to our data using least squares. There’s a question that we didn’t answer: which order of the polynomial should we choose? Clearly, the higher the order of the polynomial, the higher the complexity of the model. This is true both computationally and conceptually because in both cases we now have a higher number of adaptable parameters. The higher the complexity of a model the more variance it can capture. Given that computation is cheap, should we always pick the most complex model? As we will show below, the answer to this question is no: we have to strike a balance between variance and (inductive) bias: our model needs to have sufficient complexity to model the relationship between the predictors and the response, but it must not fit the idiosyncrasies of our training data, idiosyncrasies which will limit its ability to generalize to new, unseen cases.

This is best illustrated using a simple curve fitting example, which is adopted from C. Bishop’s Pattern Recognition and Machine Learning (2007). Let’s create a synthetic dataset by adding some random gaussian noise to a sinusoidal function.

In [1]:
%pylab inline

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge from sklearn.cross_validation import train_test_split try:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
except ImportError:
# use backports for sklearn 1.4
# available from https://s3.amazonaws.com/datarobotblog/notebooks/sklearn_backports.py
from sklearn_backports import PolynomialFeatures
from sklearn_backports import make_pipeline # ignore DeprecateWarnings by sklearn
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) np.random.seed(9) def f(x):
return np.sin(2 * np.pi * x) # generate points used to plot
x_plot = np.linspace(0, 1, 100) # generate points and keep a subset of them
n_samples = 100
X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis]
y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8) ax = plt.gca()
ax.plot(x_plot, f(x_plot), color='green')
ax.scatter(X_train, y_train, s=10)
ax.set_ylim((-2, 2))
ax.set_xlim((0, 1))
ax.set_ylabel('y')
ax.set_xlabel('x')
 
Populating the interactive namespace from numpy and matplotlib
use backports
Out[1]:
<matplotlib.text.Text at 0x5008e90>
 

Now let’s see how different polynomials can approximate this curve.

In [2]:
def plot_approximation(est, ax, label=None):
"""Plot the approximation of ``est`` on axis ``ax``. """
ax.plot(x_plot, f(x_plot), color='green')
ax.scatter(X_train, y_train, s=10)
ax.plot(x_plot, est.predict(x_plot[:, np.newaxis]), color='red', label=label)
ax.set_ylim((-2, 2))
ax.set_xlim((0, 1))
ax.set_ylabel('y')
ax.set_xlabel('x')
ax.legend(loc='upper right') #, fontsize='small') fig, axes = plt.subplots(2, 2, figsize=(8, 5))
# fit different polynomials and plot approximations
for ax, degree in zip(axes.ravel(), [0, 1, 3, 9]):
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
plot_approximation(est, ax, label='degree=%d' % degree) plt.tight_layout()
 

In the plot above we see that the polynomial of degree zero is just a constant approximation, the polynomial of degree one fits a straight line, the polynomial of degree three nicely approximates the ground truth, and finally, the polynomial of degree nine has nearly zero training error but does a poor job approximating the ground truth because it already fits the variance induced by the random gaussian noise that we added to our data.

If we plot the training and testing error as a function of the degree of the polynomial we can see what’s happening: the higher the degree of the polynomial (our proxy for model complexity), the lower the training error. The testing error decreases too, but it eventually reaches its minimum at a degree of three and then starts increasing at a degree of seven.

This phenomenon is called overfitting: the model is already so complex that it fits the idiosyncrasies of our training data, idiosyncrasies which limit the model’s ability to generalize (as measured by the testing error).

In [3]:
from sklearn.metrics import mean_squared_error

train_error = np.empty(10)
test_error = np.empty(10)
for degree in range(10):
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
est.fit(X_train, y_train)
train_error[degree] = mean_squared_error(y_train, est.predict(X_train))
test_error[degree] = mean_squared_error(y_test, est.predict(X_test)) plt.plot(np.arange(10), train_error, color='green', label='train')
plt.plot(np.arange(10), test_error, color='red', label='test')
plt.ylim((0.0, 1e0))
plt.ylabel('log(mean squared error)')
plt.xlabel('degree')
plt.legend(loc='lower left')
Out[3]:
<matplotlib.legend.Legend at 0x58d64d0>
 

In the above example, the optimal choice for the degree of the polynomial approximation would be between three and six. However, there is an alternative to manually selecting the degree of the polynomial: we can add a constraint to our linear regression model that constrains the magnitude of the coefficients in the regression model. This constraint is called the regularization term and the technique is often called shrinkage in the statistical community because it shrinks the coefficients towards zero. In the context of polynomial regression, constraining the magnitude of the regression coefficients effectively is a smoothness assumption: by constraining the L2 norm of the regression coefficients we express our preference for smooth functions rather than wiggly functions.

A popular regularized linear regression model is Ridge Regression. This adds the L2 norm of the coefficients to the ordinary least squares objective:

J(β)=1n∑ni=0(yi–βTx′i)2+α∥β∥2

where β is the vector of coefficients including the intercept term and x′i is the vector of the predictors of the i-th data point including a constant predictor for the intercept. The L2 norm term is weighted by a regularization parameter alpha: if alpha=0 then you recover the Ordinary Least Squares regression model. The larger the alpha the higher the smoothness constraint.

Below you can see the approximation of a sklearn.linear_model.RidgeRegression estimator fitting a polynomial of degree nine for various values of alpha (left) and the corresponding coefficient loadings (right). The smaller the value of alpha the higher the magnitude of the coefficients, so the functions we can model can be more and more wiggly.

In [4]:
fig, ax_rows = plt.subplots(4, 2, figsize=(8, 10))

def plot_coefficients(est, ax, label=None, yscale='log'):
coef = est.steps[-1][1].coef_.ravel()
if yscale == 'log':
ax.semilogy(np.abs(coef), marker='o', label=label)
ax.set_ylim((1e-1, 1e8))
else:
ax.plot(np.abs(coef), marker='o', label=label)
ax.set_ylabel('abs(coefficient)')
ax.set_xlabel('coefficients')
ax.set_xlim((1, 9)) degree = 9
alphas = [0.0, 1e-8, 1e-5, 1e-1]
for alpha, ax_row in zip(alphas, ax_rows):
ax_left, ax_right = ax_row
est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha))
est.fit(X_train, y_train)
plot_approximation(est, ax_left, label='alpha=%r' % alpha)
plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha) plt.tight_layout()
 

Regularization techniques

In the above example we used Ridge Regression, a regularized linear regression technique that puts an L2 norm penalty on the regression coefficients. Another popular regularization technique is the LASSO, a technique which puts an L1 norm penalty instead. The difference between the two is that the LASSO leads to sparse solutions, driving most coefficients to zero, whereas Ridge Regression leads to dense solutions, in which most coefficients are non-zero. The intuition behind the sparseness property of the L1 norm penalty can be seen in the plot below. The plot shows the value of the penalty in the coefficient space, here a space with two coefficients w0 and w1. The L2 penalty appears as a cone in this space whereas the L1 penalty is a diamond. The objective function of a regularized linear model is just the ordinary least squared solution plus the (weighted) penalty term (the point that minimizes the objective function is where those two error surfaces meet), so in the case of the L1 penalty this is usually at the spike of the diamond, a sparse solution because some coefficients are zero. For the L2 penalty, on the other hand, the optimal point generally has non-zero coefficients. Another popular regularization technique is the Elastic Net, the convex combination of the L2 norm and the L1 norm. It too leads to a sparse solution.

L2 and L1 regularization differ in how they cope with correlated predictors: L2 will divide the coefficient loading equally among them whereas L1 will place all the loading on one of them while shrinking the others towards zero. Elastic Net combines the advantages of both: it tends to either select a group of correlated predictors in which case it puts equal loading on all of them, or it completely shrinks the group.

Scikit-learn provides separate classes for LASSO and Elastic Net: sklearn.linear_model.Lasso andsklearn.linear_model.ElasticNet. In contrast to RidgeRegression, the solution for both LASSO and Elastic Net has to be computed numerically. The classes above use an optimization technique called coordinate descent. Alterntively, you can also use the class sklearn.linear_model.SGDRegressor which uses stochastic gradient descent instead and often is more efficient for large-scale, high-dimensional and sparse data.

In [5]:
from sklearn.linear_model import Lasso

fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5))

degree = 9
alphas = [1e-3, 1e-2]
for alpha, ax_row in zip(alphas, ax_rows):
ax_left, ax_right = ax_row
est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
est.fit(X_train, y_train)
plot_approximation(est, ax_left, label='alpha=%r' % alpha)
plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None) plt.tight_layout()
 
/home/pprett/workspace/scikit-learn/sklearn/linear_model/coordinate_descent.py:481: UserWarning: Objective did not converge. You might want to increase the number of iterations
' to increase the number of iterations')
 

Regularization Path Plots

Another handy diagnostic tool for regularized linear regression is the use of so-called regularization path plots. These show the coefficient loading (y-axis) against the regularization parameter alpha (x-axis). Each (non-zero) coefficient is represented by a line in this space. The example below is taken from the scikit-learn documentation. You can see that the smaller the alpha (i.e. the higher the –log(alpha), the higher the magnitude of the coefficients and the more predictors selected). You can also see that the Elastic Net tends to select more predictors, distributing the loading evenly among them, whereas L1 tends to select fewer predictors.

Regularization path plots can be efficiently created using coordinate descent optimization methods but they are harder to create with (stochastic) gradient descent optimzation methods. Scikit-learn provides a number of convenience functions to create those plots for coordinate descent based regularized linear regression models: sklearn.linear_model.lasso_path andsklearn.linear_model.enet_path.

Download Notebook View on NBViewer

This post was written by Peter Prettenhofer and Mark Steadman.  Please post any feedback, comments, or questions below or send us an email at <firstname>@datarobot.com.

Regularized Linear Regression with scikit-learn的更多相关文章

  1. Linear Regression with Scikit Learn

    Before you read  This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

  2. CheeseZH: Stanford University: Machine Learning Ex5:Regularized Linear Regression and Bias v.s. Variance

    源码:https://github.com/cheesezhe/Coursera-Machine-Learning-Exercise/tree/master/ex5 Introduction: In ...

  3. 第五次编程作业-Regularized Linear Regression and Bias v.s. Variance

    1.正规化的线性回归 (1)代价函数 (2)梯度 linearRegCostFunction.m function [J, grad] = linearRegCostFunction(X, y, th ...

  4. Andrew Ng机器学习 五:Regularized Linear Regression and Bias v.s. Variance

    背景:实现一个线性回归模型,根据这个模型去预测一个水库的水位变化而流出的水量. 加载数据集ex5.data1后,数据集分为三部分: 1,训练集(training set)X与y: 2,交叉验证集(cr ...

  5. machine learning(14) --Regularization:Regularized linear regression

    machine learning(13) --Regularization:Regularized linear regression Gradient descent without regular ...

  6. 【模式识别与机器学习】——PART2 机器学习——统计学习基础——Regularized Linear Regression

    来源:https://www.cnblogs.com/jianxinzhou/p/4083921.html 1. The Problem of Overfitting (1) 还是来看预测房价的这个例 ...

  7. 吴恩达机器学习笔记21-正则化线性回归(Regularized Linear Regression)

    对于线性回归的求解,我们之前推导了两种学习算法:一种基于梯度下降,一种基于正规方程. 正则化线性回归的代价函数为: 如果我们要使用梯度下降法令这个代价函数最小化,因为我们未对theta0进行正则化,所 ...

  8. Andrew Ng机器学习编程作业:Regularized Linear Regression and Bias/Variance

    作业文件: machine-learning-ex5 1. 正则化线性回归 在本次练习的前半部分,我们将会正则化的线性回归模型来利用水库中水位的变化预测流出大坝的水量,后半部分我们对调试的学习算法进行 ...

  9. Regularization —— linear regression

    本节主要是练习regularization项的使用原则.因为在机器学习的一些模型中,如果模型的参数太多,而训练样本又太少的话,这样训练出来的模型很容易产生过拟合现象.因此在模型的损失函数中,需要对模型 ...

随机推荐

  1. CentOS7 盒盖休眠

    https://wiki.archlinux.org/index.php/Power_management_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87) CentOS ...

  2. 点击Enter键,文本框焦点改变 分类: WinForm 2014-04-15 10:30 223人阅读 评论(0) 收藏

    一个例子: 一个简单的 登陆界面,有用户名.密码文本框.登陆按钮.  想要实现的效果是,用户输入用户名之后,点击Enter键进入到下一个文本框,同理,输入完密码之后,登陆按钮获得焦点,再次点击Ente ...

  3. Codeforces 437B The Child and Set

    题目链接:Codeforces 437B The Child and Set 開始是想到了这样的情况,比方lowbit之后从大到小排序后有这么几个数,200.100,60.50.S = 210.那先选 ...

  4. Exception in thread "main" brut.androlib.err.UndefinedResObject: resource spec: 0x01030200(转)

    反编译时遇到标题中的异常,根据描述,原因是找不到资源文件,最有可能的原因是apk中使用了系统资源. 解决办法如下: 从手机中导出framework-res.apk文件,该文件在/system/fram ...

  5. JFreeChart当鼠标停留在热点提示自定义信息功能

    当鼠标停留在数据点时,希望有提示信息显示,但是根据业务需要,我们需要自定义显示一下信息!具体功能实现如下: 首先我们要弄明白,Jfreechart鼠标提示信息显示主要是得到一个数据的map对象,该ma ...

  6. Android开发环境搭建详细图解

    所谓Android的开发环境,主要是以下两个组件,Android Software Developer Kit(Android软件开发工具包)和Eclipse(编辑器,提供很多方便功能)两大块,下面分 ...

  7. servlet中访问mysql无法包含中文的解决

    最近写servlet应用发现,如果我的sql语句中包含英文,访问数据库就失败,而我数据库的编码是utf8 -- UTF-8 Unicode,而我servlet的字符也已经转为UTF-8 ,还是不行. ...

  8. 使用Javascript限制文本框只允许输入数字

    很多时候需要用到限制文本框的数字输入,试过许多方法,都不太理想,遂决定自己实现一个来玩玩.曾经使用过的方法通过onkeydown事件来控制只允许数字: <input onkeydown=&quo ...

  9. 【原】push过快的错误 (Pushing the same view controller instance more than once is not supported)

    今天在点击按钮push viewController 时,控制台报错: Terminating app due to uncaught exception 'NSInvalidArgumentExce ...

  10. iOS开发中常用的宏

    前言 今天将一些简化工程代码的宏定义拿出来分享一下,自定义一些宏可以有效的简化代码,提高编码效率. Application #define APPLICATION [UIApplication sha ...