Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping.  These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner)
##Generate simulated data##
set.seed(27)
n<-500
data <- data.frame(W1=runif(n, min = .5, max = 1),
W2=runif(n, min = 0, max = 1),
W3=runif(n, min = .25, max = .75),
W4=runif(n, min = 0, max = 1))
data <- transform(data, #add W5 dependent on W2, W3
W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3))))
data <- transform(data, #add Y dependent on W1, W2, W4, W5
Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4))))))
summary(data)
 
##Specify a library of algorithms##
SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest")
 
##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library##
fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE)
 
##Run the cross-validated super learner to obtain its CV risk##
fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial())
 
##Cross validated risks##
mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner
fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSLcvAUCsubsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.

转自:http://blog.revolutionanalytics.com/2015/03/targeted-learning-r-packages-for-causal-inference-and-machine-learning.html

Targeted Learning R Packages for Causal Inference and Machine Learning(转)的更多相关文章

  1. Deep Learning(花书)教材笔记-Math and Machine Learning Basics(线性代数拾遗)

    I. Linear Algebra 1. 基础概念回顾 scalar: 标量 vector: 矢量,an array of numbers. matrix: 矩阵, 2-D array of numb ...

  2. Introducing: Machine Learning in R(转)

    Machine learning is a branch in computer science that studies the design of algorithms that can lear ...

  3. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

  4. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  5. How do I learn mathematics for machine learning?

    https://www.quora.com/How-do-I-learn-mathematics-for-machine-learning   How do I learn mathematics f ...

  6. 机器学习算法之旅A Tour of Machine Learning Algorithms

    In this post we take a tour of the most popular machine learning algorithms. It is useful to tour th ...

  7. Machine Learning and Data Mining(机器学习与数据挖掘)

    Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...

  8. 机器学习(Machine Learning)&深度学习(Deep Learning)资料【转】

    转自:机器学习(Machine Learning)&深度学习(Deep Learning)资料 <Brief History of Machine Learning> 介绍:这是一 ...

  9. 机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)

    ##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...

随机推荐

  1. 你说你精通CSS,真的吗?

    以前做项目的时候,学习了HTML和CSS,感觉这两个比较简单,在W3school里学习了一下之后,就觉得自己已经没问题了.可是,真正要做一个好看的页面,我还是要写好久.其实,对于CSS,我并没有像我以 ...

  2. AM335X开发板学习系列——环境搭建(vbox虚拟机ubuntu14.04下minicom的安装和配置)

    这个系列是我学习AM335X的总结. 1. ubuntu虚拟机的USB设备,选择启用usbserial 2. ubuntu虚拟机的网络,采用桥接模式,以保证开发板和ubuntu虚拟机能互相ping通 ...

  3. javascript继承详解(待续)

    常见继承分两种,一种接口继承,继承方法签名:一种实现继承,继承实际方法.js只支持后一种. 1原型链 首先看原型.构造函数.实例的关系.如果我们让一个函数的原型对象等于另一个的实例,然后另一个的原型对 ...

  4. bigdecimal更精确的浮点处理方式

    Java在java.math包中提供的API类BigDecimal,用来对超过16位有效位的数进行精确的运算.双精度浮点型变量double可以处理16位内有效数,超过16位,double可能会出现内存 ...

  5. Rookey.Frame v1.0 视频教程发布了

    经过昨天几个小时的折腾, Rookey.Frame v1.0开发视频教程终于发布了,由于是第一次做视频有很多地方做的不够好,后续我会慢慢改进,争取将视频教程做好. 本期发布视频: (一)Rookey. ...

  6. JS自定义对象,正则表达式,JQuery中的一些知识点

    一:自定义对象 1.基本概念:①对象:包含一系列无序属性和方法的集合.②键值对:对象中的数据是以键值对的形式存在的,以键取值.③属性:描述对象特征的一系列变量.[对象中的变量]④方法:描述对象行为的一 ...

  7. ueditor .net设置步骤

    1.官网http://ueditor.baidu.com,下载ueditor的.net版本 2.把下载后的文件夹放在项目content目录下 3.页面设置,Featrue为textArea的id 4. ...

  8. AspNetCore - MVC实战系列(一)

    本章开篇先简单介绍下最近两周自己利用业余时间做的一个图片收集网站,当然这个是靠用户自己上传来收集不是去抓某些个网站的图片,那样没意义,这里我取名为“爱留图”:该网站的简单介绍大家可以参考下上篇的内容爱 ...

  9. Android的root学习

    Android的内核就是Linux,所以Android获取root其实和Linux获取root权限是一回事儿.在Linux下获取root权限的时候就是执行sudo或者su,接下来系统会提示输入root ...

  10. 重启mysql提示:The server quit without updating PID file问题的解决办法

    今天因为需要开启事件调度器event_scheduler,所以修改了mysql的配置文件/etc/my.cnf 就因为配置多了个分号,导致一直启动失败,如下图所示: 然后去网上搜了帖子(MySQL提示 ...