Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.
I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.
RTCGA data
Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).
Check another RTCGA use case: TCGA and The Curse of BigData.
GLMnet
Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept
Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.
plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.
转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章
- [R] venn.diagram保存pdf格式文件?
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
- VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
- R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...
- sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
- .NET 框架(转自wiki)
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
- Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
- 哪些问题困扰着我们?DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
- Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
- data mining,machine learning,AI,data science,data science,business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
随机推荐
- 在my.ini文件中配置mysql统一字符集
测试的mysql版本为:5.7.14 查看mysql字符集命令: show variables like 'character_set_%'; 以下是在my.ini文件中配置mysql统一字符集参数: ...
- struts2自定义日期类型转换器
在java web表单中提交的数据难免会有日期类型,struts2支持的日期类型是yyyy-MM-dd,如果是其他格式,就需要自己进行转换.比如yy-MM-dd 要完成自己定义的转换需要完成. 主要的 ...
- 线段树(hdu 2795)
Billboard Time Limit: 20000/8000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total ...
- Webpack模块加载器
一.介绍 Webpack是德国开发者 Tobias Koppers 开发的模块加载器,它能把所有的资源文件(JS.JSX.CSS.CoffeeScript.Less.Sass.Image等)都作为模块 ...
- virtual box ubuntu 主机和虚拟机实现互相复制粘贴
链接:http://jingyan.baidu.com/article/574c521917db806c8d9dc18c.html 常规高级里共享粘贴板已经选中双向,(我的已经可以了复制粘贴了),如果 ...
- 详谈 Unity3D AssetBundle 资源加载,结合实际项目开发实例
第一次搞资源更新方面,这里只说更新,加载,AssetBundle资源加载,谈谈自己的理解,以及自己在项目中遇到的那些神坑,现在回想一下,真的是自己跪着过来的,说多了,都是泪. 我这边是安卓AssetB ...
- 使用vue-cli构建多页面应用+vux(三)
上节中,我们成功的将vue-cli改造成了多入口,既然用了上简单的脚手架,那就希望用个合适的UI组件,去搜索了几个以后,最后选择了使用vux 贴上其vux的github地址 https://gith ...
- RabbitMQ基础
上一博客把RabbitMQ的安装配置介绍了下,今天主要是介绍下RabbitMQ的一些基础名词. 一.什么是RabbitMQ?用它能做什么? 1.简介 AMQP,即Advanced Message Qu ...
- git远程库代码版本回滚方法
最近使用git时, 造成了远程库代码需要回滚到之前版本的情况,为了解决这个问题查看了很多资料. 问题产生原因: 提交了错误的版本到远程库. 以下是解决的方法, 供大家参考: 1.对本地代码库进行回滚 ...
- Pycharm实用技巧汇总
Pycharm中输入 a = list 按住Command点鼠标左键,即可查看该类下的所有用法,如下图 获取类中有哪些成员