glmnetUtils: quality of life enhancements for elastic net regression with glmnet
The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (cvAlpha.glmnet) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.
The formula interface
The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:
mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars) ## Call:
## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500
Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments thatpredict.glmnet needs.
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
Building the model matrix
You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.
The standard R method for creating a model matrix out of a data frame uses the model.framefunction, which has a major disadvantage when it comes to wide data. It generates a termsobject, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size p × p, where p is the number of variables in the model. When p > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.
Another issue with the standard approach is the treatment of factors. Normally, model.matrixwill turn an N-level factor into an indicator matrix with N−1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of Ncolumns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
To deal with these problems, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually). Machine learners may also recognise this as one-hot encoding.
glmnetUtils can also generate a sparse model matrix, using the sparse.model.matrix function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.
Crossvalidation for α
One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to how cv.glmnet chooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the cvAlpha.glmnet function, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:
# Leukemia dataset from Trevor Hastie's website:
# http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
load("~/Leukemia.rdata")
leuk <- do.call(data.frame, Leukemia) cvAlpha.glmnet(y ~ ., data=leuk, family="binomial") ## Call:
## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
## Number of crossvalidation folds for lambda: 10
cvAlpha.glmnet uses the algorithm described in the help for cv.glmnet, which is to fix the distribution of observations across folds and then call cv.glmnet in a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the outerParallel argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:
- Via
parLapplyin the parallel package. To use this, setouterParallelto a valid cluster object created bymakeCluster. - Via
rxExecas supplied by Microsoft R Server’s RevoScaleR package. To use this, setouterParallelto a valid compute context created byRxComputeContext, or a character string specifying such a context.
Conclusion
The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:
library(devtools)
install_github("hong-revo/glmnetUtils")
A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me athongooi@microsoft.com.
转自:http://blog.revolutionanalytics.com/2016/11/glmnetutils.html
glmnetUtils: quality of life enhancements for elastic net regression with glmnet的更多相关文章
- wlan的QOS配置
WLAN QoS配置 1.1 WLAN QoS简介 802.11网络提供了基于竞争的无线接入服务,但是不同的应用需求对于网络的要求是不同的,而原始的网络不能为不同的应用提供不同质量的接入服务,所以已 ...
- L1和L2特征的适用场景
How to decide which regularization (L1 or L2) to use? Is there collinearity among some features? L2 ...
- Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
- Kaggle实战之一回归问题
0. 前言 1.任务描述 2.数据概览 3. 数据准备 4. 模型训练 5. kaggle实战 0. 前言 "尽管新技术新算法层出不穷,但是掌握好基础算法就能解决手头 90% 的机器学习问题 ...
- Google云平台使用方法 | Hail | GWAS | 分布式回归 | LASSO
参考: Hail Hail - Tutorial windows也可以安装:Spark在Windows下的环境搭建 spark-2.2.0-bin-hadoop2.7 - Hail依赖的平台,并行处 ...
- Overfitting & Regularization
Overfitting & Regularization The Problem of overfitting A common issue in machine learning or ma ...
- stacking method house price in kaggle top10%
整合几部分代码的汇总 隐藏代码片段 导入python数据和可视化包 导入统计相关的工具 导入回归相关的算法 导入数据预处理相关的方法 导入模型调参相关的包 读取数据 特征工程 缺失值 类别特征处理-l ...
- Java Programming Language Enhancements
引用:Java Programming Language Enhancements Java Programming Language Enhancements Enhancements in Jav ...
- EasyMesh - A Two-Dimensional Quality Mesh Generator
EasyMesh - A Two-Dimensional Quality Mesh Generator eryar@163.com Abstract. EasyMesh is developed by ...
随机推荐
- CTF入门指南(0基础)
ctf入门指南 如何入门?如何组队? capture the flag 夺旗比赛 类型: Web 密码学 pwn 程序的逻辑分析,漏洞利用windows.linux.小型机等 misc 杂项,隐写,数 ...
- idea + mybatis generator + maven 插件使用
idea + mybatis generator + maven 插件使用 采用的是 generator 的 maven 插件的方式 ~ 1 pom.xml mybatis其它配置一样,下面是配置my ...
- MFC使用SQLite 学习系列 一: SQLITE_MISUSE错误
一 为什么要选择SQLite 由于使用文本文件来记录测试数据,速度越来越慢的问题,经过园友推荐,使用了SQLite来进行数据的存储,再次感谢园友@LightSmaile. 关于这个问题,可以参考一下上 ...
- 你说你精通CSS,真的吗?
以前做项目的时候,学习了HTML和CSS,感觉这两个比较简单,在W3school里学习了一下之后,就觉得自己已经没问题了.可是,真正要做一个好看的页面,我还是要写好久.其实,对于CSS,我并没有像我以 ...
- jQuery的发展史
jQuery的发展史,你知道吗? 每天多学一点知识,就少写一行代码2006年1月,jQuery的第一个版本面世,至今已经有6年多了(注:这个时间点是截止至出书时间).虽然过了这么久,但它依然以其简洁. ...
- Python中字符串拼接的三种方式
在Python中,我们经常会遇到字符串的拼接问题,在这里我总结了三种字符串的拼接方式: 1.使用加号(+)号进行拼接 加号(+)号拼接是我第一次学习Python常用的方法,我们只需要把我们要加 ...
- JavaScript 函数的定义-调用、注意事项
函数定义 函数语句定义 function(a,b){ return a+b; } 表达式定义 var add = function(a,b){return a+b}; //函数表达式可以包含名称,这在 ...
- IO流输入 输出流 字符字节流
一.流 1.流的概念 流是一组有顺序的,有起点和终点的字节集合,是对数据传输的总称或抽象.即数据在两设备间的传输称为流,流的本质是数据传输,根据数据传输特性将流抽象为各种类,方便更直观的进行数据操作. ...
- 基于 Haproxy 构建负载均衡集群
1.HAPROXY简介 HAProxy提供高可用性.负载均衡以及基于TCP和HTTP应用的代理,支持虚拟主机,它是免费.快速并且可靠的一种负载均衡解决方案.HAProxy特别适用于那些负载特大的web ...
- php超时任务处理
首先,不知道fastcgi_finish_request是啥的点这里. 一直知道php有个fastcgi_finish_request可以用来针对web应用处理耗时任务,但我一直以为直接fastcg ...