Feature Engineering versus Feature Extraction: Game On!

"Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.

However, this depends on a lot of things.

First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).

Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.

Here is some training set data where two predictors are used to model a two-class system (I'll unblind the data at the end):

There is also a corresponding test set that we will use below.

There are some observations that we can make:

  • The data are highly correlated (correlation = 0.85)
  • Each predictor appears to be fairly right-skewed
  • They appear to be informative in the sense that you might be able to draw a diagonal line to differentiate the classes

Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we'll use the area under the ROC curve on the predictor data directly.

Here are univariate box-plots of each predictor (on the log scale):

There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.

What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC's). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.

Let's compute the components:

> library(caret)
> head(example_train)
   PredictorA PredictorB Class
2 3278.726 154.89876 One
3 1727.410 84.56460 Two
4 1194.932 101.09107 One
12 1027.222 68.71062 Two
15 1035.608 73.40559 One
16 1433.918 79.47569 One
> pca_pp <- preProcess(example_train[, 1:2],
+ method = c("center", "scale", "pca"))
+ pca_pp

Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
"scale", "pca")) Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2])
> test_pc <- predict(pca_pp, example_test[, 1:2])
> head(test_pc, 4)
        PC1         PC2
1 0.8420447 0.07284802
5 0.2189168 0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371

Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?

These are the test set predictors simply rotated.

PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:

There is much more separation in the second component.

This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn't guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.

However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.

How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?

The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.

These data are from the cell segmentation experiment of Hill et al,and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1 in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.

However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.

Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.

Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".

Feature Engineering versus Feature Extraction: Game On!的更多相关文章

  1. Feature Engineering and Feature Selection

    首先,弄清楚三个相似但是不同的任务: feature extraction and feature engineering: 将原始数据转换为特征,以适合建模. feature transformat ...

  2. Discover Feature Engineering, How to Engineer Features and How to Get Good at It

    Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to s ...

  3. 学习笔记(四): Representation:Feature Engineering/Qualities of Good Features/Cleaning Data/Feature Sets

    目录 Representation Feature Engineering Mapping Raw Data to Features Mapping numeric values Mapping ca ...

  4. 特征工程(Feature Engineering)

    一.什么是特征工程? "Feature engineering is the process of transforming raw data into features that bett ...

  5. 【计算机视觉领域】常用的 feature 提取方法,feature 提取工具包

    [计算机视觉领域]常用的 feature 提取方法,feature 提取工具包 利用 VL 工具包进行各种特征的提取: VL 工具包官网地址:http://www.vlfeat.org/index.h ...

  6. 机器学习-特征工程-Feature generation 和 Feature selection

    概述:上节咱们说了特征工程是机器学习的一个核心内容.然后咱们已经学习了特征工程中的基础内容,分别是missing value handling和categorical data encoding的一些 ...

  7. feature.shape和feature.shapecopy的区别

    以前在写AE代码的时候也没有注意到feature.shape和feature.shapecopy的区别,觉得两者也差不多: 今天写入库程序才明白过来. 如果取feature.shape,则得到的是该要 ...

  8. Google Play和基于Feature的过滤 —— Feature 参考手册

    翻译自 Features Reference 下表列出了软/硬件Feature和权限的参考信息,它们被用于GooglePlay. 硬件feature 下面列出了被大多数当前发布的平台所支持的硬件功能描 ...

  9. tensorflow-TFRecord报错ValueError: Protocol message Feature has no "feature" field.

    编写代码用TFRecord数据结构存储数据集信息是报错:ValueError: Protocol message Feature has no "feature" field.或和 ...

随机推荐

  1. 有关c#装箱和拆箱知识整理

    c#装箱和拆箱知识,装箱和拆箱是一个抽象的概念. 1.装箱和拆箱是一个抽象的概念  2.装箱是将值类型转换为引用类型 : 拆箱是将引用类型转换为值类型 利用装箱和拆箱功能,可通过允许值类型的任何值与O ...

  2. PHP,Mysql-根据一个给定经纬度的点,进行附近地点查询–合理利用算法,效率提高2125倍

    目前的工作是需要对用户的一些数据进行分析,每个用户都有若干条记录,每条记录中有用户的一个位置,是用经度和纬度表示的.还有一个给定的数据库,存储的是一些已知地点以及他们的经纬度,内有43W多条的数据.现 ...

  3. SLF4J日志门面

    SLF4J官网:http://www.slf4j.org/ SLF4J的作用通俗点讲,就是可以让我们的项目以最小的代价更换不同的日志系统.无需修改代码,只需要添加.删除相应的jar包和配置文件. 1. ...

  4. gRPC Client的负载均衡器

    一.gRPC是什么? gRPC是一个高性能.通用的开源RPC框架,其由Google主要面向移动应用开发并基于HTTP/2协议标准而设计,基于ProtoBuf(Protocol Buffers)序列化协 ...

  5. Vmware下Ubuntu无法上网的问题

    本来这个挺简单的个问题,但是由于很久没有使用虚拟机并且期间实体机网络环境发生了一些变化,导致了一些麻烦. 一般用NAT就行了,就是Vmware右下角那个图标(左起第4个)设置就行. 我这么设置了还是不 ...

  6. Fix: Compile project encounter undefined reference to“xxx”error

    Need to add all the new cpp files to jni/Andriod.mk folder:

  7. 2016/09/21 java split用法

    public String[] split(String regex) 默认limit为0 public String[] split(String regex, int limit) 当limit& ...

  8. NOJ1066-堆排序

    堆排序 时间限制(普通/Java) : 1000 MS/ 3000 MS          运行内存限制 : 65536 KByte总提交 : 414            测试通过 : 220  比 ...

  9. 【转】SQLite提示database disk image is malformed的解决方法

    SQLite有一个很严重的缺点就是不提供Repair命令. 导致死亡提示database disk image is malformed 它的产生有很多种可能,比如,磁盘空间不足,还有就是写入数据过程 ...

  10. DrawerLayout带有侧滑功能的布局类(1)

    DrawerLayout: DrawerLayout顾名思义就是一个管理布局的.使用方式可以与其它的布局类类似. DrawerLayout带有滑动的功能.只要按照drawerLayout的规定布局方式 ...