Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

  1. [R] venn.diagram保存pdf格式文件?

    vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...

  2. VennDiagram 画文氏图/维恩图/Venn

    install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...

  3. R绘制韦恩图 | Venn图

    解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...

  4. sql的各种join连接

    SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...

  5. .NET 框架(转自wiki)

    .NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...

  6. Python画图笔记

    matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示   h ...

  7. 哪些问题困扰着我们?DevOps 使用建议

    [编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...

  8. Transparency Tutorial with C# - Part 1

    Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...

  9. data mining,machine learning,AI,data science,data science,business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

随机推荐

  1. cocoapods安装好后repo换源

    1.pod repo 然后会出现以下内容,如下是我已经换了之后的,而你的URL还是github的 master - Type: git (master) - URL:  https://git.cod ...

  2. python黑魔法之metaclass

    最近了解了一下python的metaclass,在学习的过程中,把自己对metaclass的理解写出来和大家分享. 首先, metaclass 中文叫元类,这个元类怎么来理解呢.我们知道,在Pytho ...

  3. jenkins+webhook+docker做持续集成

    简介:我们现在都流行把项目封装成docker的镜像,不过实际用的时候就会发现很麻烦,我们每次更改代码了以后都要打包成docker容器 ,事实证明项目比较多的时候真的会让人崩溃,我这边用spring c ...

  4. php object 对象系统

    php object 对象系统 概述 本节内容仅谈论对象系统内容, 对于相关内容并不做更深一步的扩展, 相关扩展的内容会在后续补充 object 对象属于 zval 结构的一种形式 php 将所有执行 ...

  5. mybatis中oracle实现分页效果

    首先当我们需要通过xml格式处理sql语句时,经常会用到< ,<=,>,>=等符号,但是很容易引起xml格式的错误,这样会导致后台将xml字符串转换为xml文档时报错,从而导致 ...

  6. 原生ajax实现http请求

      1⃣️先简单了解一下HTTP协议: http是计算机通过网络进行通信的一种规则,它是一种无状态协议(不建立持久链接,直白点儿说就是请求响应完事儿之后,链接就断开)  2⃣️一个完整的http请求有 ...

  7. python基本数据类型——list

    一.创建列表: li = [] li = list() name_list = ['alex', 'seven', 'eric'] name_list = list(['alex', 'seven', ...

  8. BackgroundWorker的DoWork方法中发生异常无法传递到RunWorkedCompleted方法

    在使用C#的BackgroundWorker时需要在UI界面上显示DoWork中发生的异常,但怎么调试都无法跳转到界面上,异常也不会传递到RunWorkerCompleted方法中(e.Error为空 ...

  9. 32位机器的LowMemory

        今天在和供应商交流的过程中,被严重鄙视了,竟然认为我连"LowMemory"都没有听说过.感觉很郁闷,好歹我也搞过一段时间memory Management,怎么可能连Lo ...

  10. 机器学习笔记-1 Linear Regression with Multiple Variables(week 2)

    1. Multiple Features note:X0 is equal to 1 2. Feature Scaling Idea: make sure features are on a simi ...