Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

  1. [R] venn.diagram保存pdf格式文件?

    vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...

  2. VennDiagram 画文氏图/维恩图/Venn

    install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...

  3. R绘制韦恩图 | Venn图

    解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...

  4. sql的各种join连接

    SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...

  5. .NET 框架(转自wiki)

    .NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...

  6. Python画图笔记

    matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示   h ...

  7. 哪些问题困扰着我们?DevOps 使用建议

    [编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...

  8. Transparency Tutorial with C# - Part 1

    Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...

  9. data mining,machine learning,AI,data science,data science,business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

随机推荐

  1. SQL SERVER 远程备份DB

    --检查sqlserver所在服务的运行账号是否有权限访问共享文件夹,没有的话右键添加写权限 --开启权限sp_configure 'show advanced options', 1;gorecon ...

  2. JAVA经典算法40题

    1: JAVA经典算法40题 2: [程序1] 题目:古典问题:有一对兔子,从出生后第3个月起每个月都生一对兔子,小兔子长到第四个月后每个月又生一对兔子,假如兔子都不死,问每个月的兔子总数为多少? 3 ...

  3. [编织消息框架][JAVA核心技术]动态代理应用9-扫描class

    之前介绍的annotationProcessor能在编译时生成自定义class,但有个缺点,只能每次添加/删除java文件才会执行,那天换了个人不清楚就坑大了 还记得之前介绍的编译时处理,懒处理,还有 ...

  4. CTR预估中的贝叶斯平滑方法(二)参数估计和代码实现

    1. 前言 前面博客介绍了CTR预估中的贝叶斯平滑方法的原理http://www.cnblogs.com/bentuwuying/p/6389222.html. 这篇博客主要是介绍如何对贝叶斯平滑的参 ...

  5. 借助case,实现更丰富的分组查询统计

    根据fileD6的前4位分组    分别统计该组  5种企业类型fileD31的数量 create or replace view jyjc_bycity as select substr(fileD ...

  6. jQuery的工作原理

    jQuery是为了改变javascript的编码方式而设计的. jQuery本身并不是UI组件库或其他的一般AJAX类库. 那么它是如何实现它的声明的呢? 先看一段简短的使用流程: (1).查找(创建 ...

  7. sphinx全文检索引擎

    今天刚刚学习了一下,就直接分享上去,有些还没有接触,如果有问题请指正,谢谢 sphinx是什么? Sphinx是一个全文检索引擎.主要为其他应用提供高速.低空间占用.高结果 相关度的全文搜索功能. S ...

  8. 【转载】GPIO模拟i2c通信

    I2C总线的通信过程(见图4-8)主要包含三个主要阶段:起始阶段.数据传输阶段和终止阶段. 1. 起始阶段 在I2C总线不工作的情况下,SDA(数据线)和SCL(时钟线)上的信号均为高电平.如果此时主 ...

  9. 修改linux系统时间和同步

    date 查看当前时间 date -s 15:14:13 修改时间 cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime 修改时区 yes cront ...

  10. 悬挂else引发的问题

    这个问题虽然已经为人熟知,而且也并非C语言所独有,但即使是有多年经验的C程序员也常常在此失误过. 考虑下面的程序片段: if (x == 0) if (y == 0) error(); else{ z ...