Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.
I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.
RTCGA data
Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).
Check another RTCGA use case: TCGA and The Curse of BigData.
GLMnet
Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept
Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.
plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.
转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章
- [R] venn.diagram保存pdf格式文件?
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
- VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
- R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...
- sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
- .NET 框架(转自wiki)
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
- Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
- 哪些问题困扰着我们?DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
- Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
- data mining,machine learning,AI,data science,data science,business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
随机推荐
- ios animation 动画效果实现
1.过渡动画 CATransition CATransition *animation = [CATransition animation]; [animation setDuration:1.0]; ...
- Java集合的区别和选择
Collection |--List 有序,可重复 |--ArrayList 底层数据结构是数组,查询快,增删慢. 线程不安全,效率高 |--Vector 底层数据结构 ...
- webapp 1px显示两倍的问题
公司最近换新首页,按照设计师的要求<大家都在逛>的分割线要1个像素. .span-3{ width:33.3333%; &:not(:first-child){ &:bef ...
- HYML / CSS和Javascript 部分
1 CSS实现垂直水平居中 HTML结构: <div class="wrapper"> <div class="content">&l ...
- CF #335 div1 A. Sorting Railway Cars
题目链接:http://codeforces.com/contest/605/problem/A 大意是对一个排列进行排序,每一次操作可以将一个数字从原来位置抽出放到开头或结尾,问最少需要操作多少次可 ...
- python——面向对象进阶
类的成员 类的成员可以分为三大类:字段.方法和属性 注:所有成员中,只有普通字段的内容保存对象中,即:根据此类创建了多少对象,在内存中就有多少个普通字段.而其他的成员,则都是保存在类中,即:无论对象的 ...
- C语言学习第二章
今天开始学习常量,变量,基本数据类型,printf()函数和scanf()函数,算术运算符. 首先常量:是在程序中保持不变的量 变量:编写程序时,常常需要将数据存储在内存中,方便后面使用这个数据或者修 ...
- API 管理工具
API 管理工具 你还苦于无法有效的管理大量的API吗?今天给大家介绍一款API的管理工具.这款工具可以免费使用,虽然中途可能会提示你购买,但并不影响我们的使用. 下载地址: Windows:http ...
- Linux下安装单机版zookeeper(和dubbo配合验证)和redis(用图形化界面连接验证)
上次写了篇zookeeper的集齐,并且用dubbo admin验证了集群结果.最近又特地装了个虚拟机,专门装各种单机版的,免得跟集群的机器混合了.安装的虚拟机IP为192.168.1.108 1.单 ...
- Linux Shell——函数的使用
文/一介书生,一枚码农. scripts are for lazy people. 函数是存在内存里的一组代码的命名的元素.函数创建于脚本运行环境之中,并且可以执行. 函数的语法结构为: functi ...