Targeted Learning R Packages for Causal Inference and Machine Learning(转)
Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.
Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.
Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.
Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.
Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.
library(SuperLearner)
##Generate simulated data##
set.seed(27)
n<-500
data <- data.frame(W1=runif(n, min = .5, max = 1),
W2=runif(n, min = 0, max = 1),
W3=runif(n, min = .25, max = .75),
W4=runif(n, min = 0, max = 1))
data <- transform(data, #add W5 dependent on W2, W3
W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3))))
data <- transform(data, #add Y dependent on W1, W2, W4, W5
Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4))))))
summary(data)
##Specify a library of algorithms##
SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest")
##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library##
fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE)
##Run the cross-validated super learner to obtain its CV risk##
fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial())
##Cross validated risks##
mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner
fit.data.SL #CV risks for algorithms in the library
The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.
The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.
The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.
Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.
Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.
The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.
Targeted Learning R Packages for Causal Inference and Machine Learning(转)的更多相关文章
- Deep Learning(花书)教材笔记-Math and Machine Learning Basics(线性代数拾遗)
I. Linear Algebra 1. 基础概念回顾 scalar: 标量 vector: 矢量,an array of numbers. matrix: 矩阵, 2-D array of numb ...
- Introducing: Machine Learning in R(转)
Machine learning is a branch in computer science that studies the design of algorithms that can lear ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- How do I learn mathematics for machine learning?
https://www.quora.com/How-do-I-learn-mathematics-for-machine-learning How do I learn mathematics f ...
- 机器学习算法之旅A Tour of Machine Learning Algorithms
In this post we take a tour of the most popular machine learning algorithms. It is useful to tour th ...
- Machine Learning and Data Mining(机器学习与数据挖掘)
Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】
转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)
##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...
随机推荐
- Robotframe work之环境搭建(一)
准备安装如下:Python2.7.10.robot framework3.0.2.wxPython 2.8.12.1.robot framework-ride 1. 官网下载安装python,目前wx ...
- Lambda&Java多核编程-7-类型检查
本篇主要介绍Lambda的类型检查机制以及周边的一些知识. 类型检查 在前面的实践中,我们发现表达式的类型能够被上下文所推断.即使同一个表达式,在不同的语境下也能够被推断成不同类型. 这几天在码一个安 ...
- Rookey.Frame v1.0极速开发平台稳定版发布
Rookey.Frame v1.0经过一年时间的修改及沉淀,稳定版终于问世了,此版本经过上线系统验证,各个功能点都经过终端用户验证并持续优化,主要优化以下几个方面: 1.性能较原来提升3倍之多 2.修 ...
- reids数据类型
今天第一次开通,写的不好,请谅解 redis并不是简单的key-value存储,实际上它是一个数据结构服务器,支持不同类型的值,也就是说,我们不仅仅把字符串当作键所指向的值, 如下这些数据 ...
- 事件驱动的简明讲解(python实现)
关键词:编程范式,事件驱动,回调函数,观察者模式 作者:码匠信龙 举个简单的例子: 有些人喜欢的某个公众号,然后去关注这个公众号,哪天这个公众号发布了篇新的文章,没多久订阅者就会在微信里收到这个公众号 ...
- 打印星号(*)三角形(C# Linq实现)的小例子
以前看面试宝典(C#)的时候,记得有一道题是打印三角形的.比如下图: 记得那时候刚学C#花了我好长时间才做出来,那是用的方法没有使用到linq,现在使用Linq重新做一次.以下是代码: ; ; i & ...
- 解决Javascript大数据列表引起的网页加载慢/卡死问题。
在一些网页应用中,有时会碰到一个超级巨大的列表,成千上万行,这时大部份浏览器解析起来就非常痛苦了(有可能直接卡死). 也许你们会说可以分页或动态加载啊?但是有可能需求不允许分页,动态加载?网络的延迟也 ...
- C#总结(二)事件Event 介绍总结
最近在总结一些基础的东西,主要是学起来很难懂,但是在日常又有可能会经常用到的东西.前面介绍了 C# 的 AutoResetEvent的使用介绍, 这次介绍事件(event). 事件(event),对于 ...
- bzoj4785 [Zjoi2017]树状数组
Description 漆黑的晚上,九条可怜躺在床上辗转反侧.难以入眠的她想起了若干年前她的一次悲惨的OI 比赛经历.那是一道基础的树状数组题.给出一个长度为 n 的数组 A,初始值都为 0,接下来进 ...
- [HNOI2004]宠物收养场 Treap前驱后继
凡凡开了一间宠物收养场.收养场提供两种服务:收养被主人遗弃的宠物和让新的主人领养这些宠物. 每个领养者都希望领养到自己满意的宠物,凡凡根据领养者的要求通过他自己发明的一个特殊的公式,得出该领养者希望领 ...