the steps that may be taken to solve a feature selection problem:特征选择的步骤
參考:JMLR的paper《an introduction to variable and feature selection》
we summarize the steps that may be taken to solve a feature selection problem in a check list:
1. Do you have domain knowledge?
If yes, construct a better set of “ad hoc” features.
2. Are your features commensurate(能够同单位度量的)?
If no, consider normalizing them.
3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features(通过构建联合特征<应该是多个variables当做一个feature>或高次特征。扩展您的功能集), as much as your computer resources
allow you(see example of use in Section 4.4).
4. Do you need to prune(裁剪) the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features(构建析取特征<应该是一个variables当做一个feature>或加权和特征) (e.g. by clustering or matrix factorization, see
Section 5).
5. Do you need to assess features individually(单独评估每一个feature) (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method (Section 2 and Section 7.2); else,
do it anyway to get baseline results.
6. Do you need a predictor? If no, stop.
7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them(注意:这里的them是example的意思,不是feature。
。
。).
8. Do you know what to try first? If no, use a linear predictor. Use a forward selection method(Section 4.2) with the “probe” method as a stopping criterion (Section 6) or use the L0-norm embedded
method (Section 4.3). For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset?
If yes, try a non-linear predictor with
that subset.
9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods (Section 4). Use linear and non-linear
predictors. Select the best approach with model selection (Section 6).
10. Do you want a stable solution (to improve performance and/or understanding)? If yes, sub-sample your data and redo your analysis for several “bootstraps” (Section 7.1)
Section 2:describing
filters that select variables by ranking them with correlation coefficients.(经常使用的标准有皮尔逊相关系数、互信息等)
Section 3:Limitations of such approaches(filters) are illustrated by a set of constructed examples. (通过以上标准每次筛选一个“最好的”variable是有局限的,由于variables的组合往往能够比一个variable效果好。即使是看起来无用的variable。怎样和实用的variables结合。或者几个无用的variables结合,都能够provide
a significant performance improvement)
Section 4:Subset selection methods are then introduced. These include wrapper methods that assess subsets
of variables according to their usefulness to a given predictor(就是很easy的逐步添加或者候选消除:http://blog.csdn.net/mmc2015/article/details/47426437). We show how some
embedded methods implement
the same idea, but proceed more efficiently by directly optimizing a two-part objective function with
a goodness-of-fit term and a penalty for a large number of variables(就是所谓的L0-norm、L1-norm等).
Section 5:We then turn to the problem of feature construction, whose goals include increasing the predictor performance and building more compact
feature subsets. All of the previous steps benefit from reliably assessing the statistical significance of the relevance of features. (常见的方法有:聚类,本质思想是。将多个相似的variables用他们的聚类中心取代,最经常使用的是k-mean和层次聚类;矩阵分解法,本质思想是对输入的variables进行线性转换,如PCA/SVD/LDA等;非线性变换,kernel方法。
。。)
Section 6:We briefly review model selection methods and statistical tests used to that effect.
Section 7:Finally, we conclude the paper with a discussion section in which we go over more advanced issues.
we recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two alternate ways: (1) with a variable ranking method using a correlation coefficient or mutual information; (2) with a nested
subset selection method performing forward or backward selection or with multiplicative updates
the steps that may be taken to solve a feature selection problem:特征选择的步骤的更多相关文章
- Solve one floodlight install problem
参考: Floodlight安装 SDNLAB Floodlight官网 Installation Guide 问题: 在follow安装教程安装Floodlight的过程中,ant编译时出现了: [ ...
- [Algorithms] Using Dynamic Programming to Solve longest common subsequence problem
Let's say we have two strings: str1 = 'ACDEB' str2 = 'AEBC' We need to find the longest common subse ...
- python data analysis | python数据预处理(基于scikit-learn模块)
原文:http://www.jianshu.com/p/94516a58314d Dataset transformations| 数据转换 Combining estimators|组合学习器 Fe ...
- 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...
- 3 Steps(二分图)
C - 3 Steps Time limit : 2sec / Memory limit : 256MB Score : 500 points Problem Statement Rng has a ...
- (转) [it-ebooks]电子书列表
[it-ebooks]电子书列表 [2014]: Learning Objective-C by Developing iPhone Games || Leverage Xcode and Obj ...
- Deep Learning in a Nutshell: History and Training
Deep Learning in a Nutshell: History and Training This series of blog posts aims to provide an intui ...
- (转)Let’s make a DQN 系列
Let's make a DQN 系列 Let's make a DQN: Theory September 27, 2016DQN This article is part of series Le ...
- Common Pitfalls In Machine Learning Projects
Common Pitfalls In Machine Learning Projects In a recent presentation, Ben Hamner described the comm ...
随机推荐
- CSS2.1(布局)
浏览器内核 Firefox : geckoIE: tirdentSafari: webkitChrome: 一开始使用webkit 后来基于webkit开发了Blinkopera: 一开始使用pres ...
- Centos/RHEL :How to add,delete and display LVM tags
1. 什么是LVM标签? 在你想开机启动时让逻辑卷被激活可用时,添加lvm标签是一个不错的选择.lvm标签允许那些被预先标记的实现这样的效果. 2. 配置文件 配置文件/etc/lvm/lvm.con ...
- 学习推荐《精通Python网络爬虫:核心技术、框架与项目实战》中文PDF+源代码
随着大数据时代的到来,我们经常需要在海量数据的互联网环境中搜集一些特定的数据并对其进行分析,我们可以使用网络爬虫对这些特定的数据进行爬取,并对一些无关的数据进行过滤,将目标数据筛选出来.对特定的数据进 ...
- JQ each 各种标签
类选择器: $("input[class=class1]").each(function(){ alert($(this).val()); }); ID选择器: $("i ...
- EditPlus,UltraEdit等编辑器列选择的方法
在使用富文本编辑器的时候,通常模式是行选择状态,由于今天想使用EditPlus列选择状态, 于是通过在网上收集的资料,总结出相关富文本编辑器的列选择的方法. EditPlus 1)菜单:编辑 -&g ...
- cogs 1430. [UVa 11300]分金币
1430. [UVa 11300]分金币 ★☆ 输入文件:Wealth.in 输出文件:Wealth.out 简单对比时间限制:1 s 内存限制:256 MB [题目描述] 圆桌旁坐着 ...
- LuoguP4012 深海机器人问题(费用流)
题目描述 深海资源考察探险队的潜艇将到达深海的海底进行科学考察. 潜艇内有多个深海机器人.潜艇到达深海海底后,深海机器人将离开潜艇向预定目标移动. 深海机器人在移动中还必须沿途采集海底生物标本.沿途生 ...
- Dubbo简易学习
0. Dubbo简易介绍 DUBBO是一个分布式服务框架,致力于提供高性能和透明化的RPC远程服务调用方案,是阿里巴巴SOA服务化治理方案的核心框架,每天为2,000+个服务提供3,000,000, ...
- 免费超大量邮件发送服务Amazon SES和Mailgun提供SMTP和API支持
一般来说网站注册.论坛消息.新闻推送.广告宣传等都会有发送邮件服务,大量的邮件发送服务如果用PHP来发送,一是会消耗主机资源,二是容易被各大邮箱判定为垃圾邮件而被拒收.用第三方的邮局服务发送邮件,可以 ...
- HDU 3400 Line belt (三分再三分)
HDU 3400 Line belt (三分再三分) ACM 题目地址: pid=3400" target="_blank" style="color:rgb ...