Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported

source("https://bioconductor.org/biocLite.R")

biocLite("RTCGA.rnaseq")

library(RTCGA.rnaseq)

BRCA.rnaseq$bcr_patient_barcode <-

   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)

registerDoMC(cores=6)

library(glmnet)

# fit the model

cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),

          y = factor(BRCA.rnaseq[, 1]),

          family = "binomial",

          type.measure = "class",

          parallel = TRUE) -> cvfit

# extract feature names that have

# non zero coefficiant

names(which(

   coef(cvfit, s = "lambda.min")[, 1] != 0)

   )[-1] -> glmnet.features

# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自：http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

[R] venn.diagram保存pdf格式文件？
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔酷炫版,VennDiagram 特别注意: ...
sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
.NET 框架（转自wiki）
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
哪些问题困扰着我们？DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
data mining，machine learning，AI，data science，data science，business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

随机推荐

浩哥解析MyBatis源码（八）——Type类型模块之TypeAliasRegistry（类型别名注册器）
原创作品,可以转载,但是请标注出处地址:http://www.cnblogs.com/V1haoge/p/6705769.html 1.回顾前面几篇讲了数据源模块,这和之前的事务模块都是enviro ...
Arrays工具类十大常用方法
0. 声明数组 String[] aArray = new String[5]; String[] bArray = {"a","b","c" ...
一个web应用的诞生(11)--在探首页
就要面对本章的一个难点了,说是难点可能仅仅对于我来说,毕竟我是一个js渣,既然首页打算使用动态加载的形式,那么与后台交互的方式就要进行选择,目前比较流行的为RESTful的形式,关于RESTful的文 ...
SQL语句简单整理
转载原文:http://blog.sina.com.cn/s/blog_48df31d901017c6o.html 1.用户 - 查看当前用户的缺省表空间 select username,defaul ...
Linux-配置vim开发环境
vim是一个类似于vi的著名的功能强大.高度可定制的文本编辑器,在vi的基础上改进和增加了很多特性.vim是纯粹的自由软件. 为了满足使用者的要求,将vim界面配置为自己想要的界面类型也变得流行起来. ...
GPIO的配置过程
今天看到一篇很好的博文,,看这里:http://www.cnblogs.com/crazyxu/archive/2011/10/14/2212337.html 下面总结一下,加深一下理解. 要使用GP ...
LinkCode 整数排序II
http://www.lintcode.com/zh-cn/problem/sort-integers-ii/ 题目给一组整数,按照升序排序.使用归并排序,快速排序,堆排序或者任何其他 O(n lo ...
Less与Sass
less 1.变量声明变量:@变量名:变量值使用变量:@变量名 >>>Less中变量的类型 ①数字类:1 100px ②字符串:无引号字符串[red] 有引号字符串[&qu ...
自定义list排序
使用扩展方法OrderBy,OrderByDescending,效果优良. 实现代码如下: private static void SortByExtensionMethod() { List< ...
音频特征提取——librosa工具包使用
作者:桂. 时间:2017-05-06 11:20:47 链接:http://www.cnblogs.com/xingshansi/p/6816308.html 前言本文主要记录librosa工具 ...

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

RTCGA data

GLMnet

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

随机推荐

热门专题