Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.
I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.
RTCGA data
Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).
Check another RTCGA use case: TCGA and The Curse of BigData.
GLMnet
Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept
Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.
plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.
转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章
- [R] venn.diagram保存pdf格式文件?
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
- VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
- R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...
- sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
- .NET 框架(转自wiki)
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
- Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
- 哪些问题困扰着我们?DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
- Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
- data mining,machine learning,AI,data science,data science,business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
随机推荐
- php数组--2017-04-16
一.定义数组 (1)索引数组 $arr=array(1,2,3,3); (2)关联数组 类似于集合 $arr1=array("one"=>"111",& ...
- win8.1启用ahci后蓝屏
先简单介绍一下,本应该win7开始,系统安装的时候默认就启用了ahci硬盘模式.但是博主犯了傻,装了win8.1后安装win XP形成双系统.xp并不支持ahci模式,所以将硬盘模式改成了IDE模式, ...
- mpu6050参数获取
MPU6050其实就是一个 I2C 器件,里面有很多寄存器(但是我们用到的只有几个),我们通过读写寄存器来操作这个芯片.所以首要问题就是 STM32 和 MPU6050 的 I2C 通信.1.配置 S ...
- 仿QQ空间和微信朋友圈,高解耦高复用高灵活
先看看效果: 用极少的代码实现了 动态详情 及 二级评论 的 数据获取与处理 和 UI显示与交互,并且高解耦.高复用.高灵活. 动态列表界面MomentListFragment支持 下拉刷新与上拉加载 ...
- Let's Encrypt: 为CentOS/RHEL 7下的nginx安装https支持-具体案例
环境说明: centos 7 nginx 1.10.2 前期准备 软件安装 yum install -y epel-release yum install -y certbot 创建目录及链接 方法1 ...
- Display:table;妙用,使得左右元素高度相同
我们在设计网页的时候,为了左右能够分明一点,我们经常会在左边元素弄一个border-right,但是出现一个问题,如果左边高度比较小,这根线就短了,下面空了一部分,反正如果在右边的元素弄一个borde ...
- ms_celeb_1m数据提取(MsCelebV1-Faces-Aligned.tsv)python脚本
本文主要介绍了如何对MsCelebV1-Faces-Aligned.tsv文件进行提取 原创by南山南北秋悲 欢迎引用!请注明原地址 http://www.cnblogs.com/hwd9654/p/ ...
- Linux 服务器 U盘安装(避免U盘启动)
首先下载两个文件: · rhel-server-6.3-i386-boot.iso 启动镜像 · rhel-server-6.3-i386-dvd.iso ...
- CVE-2014-0038内核漏洞原理与本地提权利用代码实现分析 作者:seteuid0
关键字:CVE-2014-0038,内核漏洞,POC,利用代码,本地提权,提权,exploit,cve analysis, privilege escalation, cve, kernel vuln ...
- 区别client、offset、scroll系列以及event的几个距离属性
element元素结点属性 一. offset系列 1.offsetWidth 和offsetHeight element.offsetWidth是一个只读属性,它包括了: css width + p ...