Seven Techniques for Data Dimensionality Reduction

Seven Techniques for Data Dimensionality Reduction

12 May, 2015 - 12:38 — rs

The recent explosion of data set size, in number of records and attributes, has triggered the development of a number of big data platforms as well as parallel data analytics algorithms. At the same time though, it has pushed for usage of data dimensionality reduction procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performances in data analytics applications.

One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set. The particularity of this data set consists of its very high dimensionality with 15K data columns. Most data mining algorithms are column-wise implemented, which makes them slower and slower on a growing number of data columns. The first milestone of the project was then to reduce the number of columns in the data set and lose the smallest amount of information possible at the same time.

Using the project as an excuse, we started exploring the state-of-the-art on dimensionality reduction techniques currently available and accepted in the data analytics landscape.

  • Missing Values Ratio. Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction.
  • Low Variance Filter. Similarly to the previous technique, data columns with little changes in the data carry little information. Thus all data columns with variance lower than a given threshold are removed. A word of caution: variance is range dependent; therefore normalization is required before applying this technique.
  • High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical columns and between nominal columns as the Pearson’s Product Moment Coefficient and thePearson's chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.
  • Random Forests / Ensemble Trees. Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers.  One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features.  Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain. A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other attributes ‒ which are the most predictive attributes.
  • Principal Component Analysis (PCA)Principal Component Analysis (PCA) is a statistical procedure that orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Keeping only the first m < ncomponents reduces the data dimensionality while retaining most of the data information, i.e. the variation in the data. Notice that the PCA transformation is sensitive to the relative scaling of the original variables. Data column ranges need to be normalized before applying PCA. Also notice that the new coordinates (PCs) are not real system-produced variables anymore. Applying PCA to your data set loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation for your project.
  • Backward Feature Elimination. In this technique, at a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. The classification is then repeated using n-2 features, and so on. Each iteration k produces a model trained on n-k features and an error rate e(k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.
  • Forward Feature Construction. This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance. Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns.

We picked this chance to compare those techniques on the smaller data set of the 2009 KDD challenge in terms of reduction ratio, degrading accuracy, and speed. The final accuracy and its degradation depend, of course, on the model selected for the analysis. Thus, the compromise between reduction ratio and final accuracy is optimized against a bag of three specific models: decision tree, neural networks, and naïve Bayes.

Running the optimization loop, the best cutoffs, in terms of lowest number of columns and best accuracy, were determined for each one of the seven dimensionality reduction techniques and for the best performing model. The final best model performance, as accuracy and Area under the ROC Curve, was compared with the performance of the baseline algorithm using all input features. Results of this comparison are reported in the table below.

Dimensionality Reduction Reduction Rate Accuracy on validation set Best Threshold AuC Notes
Baseline 0% 73% - 81% Baseline models are using all input features
Missing Values Ratio 71% 76% 0.4 82% -
Low Variance Filter 73% 82% 0.03 82% Only for numerical columns
High Correlation Filter 74% 79% 0.2 82% No correlation available between numerical and nominal columns
PCA 62% 74% - 72% Only for numerical columns
Random Forrest / Ensemble Trees 86% 76% - 82% -
Backward Feature Elimination + missing values ratio 99% 94% - 78% Backward Feature Elimination and Forward Feature Construction are prohibitively slow on high dimensional data sets. It becomes practical to use them, only if following other dimensionality reduction techniques, like here the one based on the number of missing values.
Forward Feature Construction + missing values ratio 91% 83% - 63%

Notice that the highest reduction ratio without performance degradation is obtained by analyzing the decision cuts in many random forests (Random Forests/Ensemble Trees). However, even just counting the number of missing values, measuring the column variance, and measuring the correlation of pairs of columns can lead to a satisfactory reduction rate while keeping performance unaltered with respect to the baseline models.

What we have learned from this little review exercise, is that dimensionality reduction is not only useful to speed up algorithm execution, but also to improve model performance. The Area under the Curve (AuC) in the table shows a slight increase on the test data, when the missing value ratio, the low variance filter, the high correlation filter criteria, or the random forests are applied.

Indeed, in the era of big data, when more is axiomatically better, we have re-discovered that too many noisy or even faulty input data columns often lead to a less than desirable algorithm performance. Removing un-informative or even worse dis-informative input attributes might help build a model on more extensive data regions, with more general classification rules, and overall with better performances on new unseen data.

Recently, we asked data analysts on a LinkedIn group (https://www.linkedin.com/grp/post/35222-5998794653007171586) for the most used dimensionality reduction techniques, besides the seven described in this blog post. The answers involved Random Projections, NMF, (Stacked) Auto-encoders, Chi-square or Information Gain, Multidimensional Scaling, Correspondence Analysis, Factor Analysis, Clustering, and Bayesian Models. Thanks to Asterios StergioudisRaoul Savos, and Michael Will who provided the suggestions on the LinkedIn group.

The workflows described in this blog post are available on the KNIME EXAMPLES server under 003_Preprocessing/003005_dimensionality_reduction.

Both small and large data sets from the 2009 KDD Challenge can be downloaded from http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction.

This is just a brief summary of the whole project. If you are interested in all the tiny details, you can always read the related whitepaper, in the Whitepapers section on the KNIME web site:https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

Below are the ROC curves for all the evaluated dimensionality reduction techniques and the best performing machine learning algorithm. The value of the area under the curve is shown in the legend.

Further Reading:

Seven Techniques for Data Dimensionality Reduction的更多相关文章

  1. dimensionality reduction动机---data compression(使算法提速)

    data compression可以使数据占用更少的空间,并且能使算法提速 什么是dimensionality reduction(维数约简)    例1:比如说我们有一些数据,它有很多很多的feat ...

  2. 可视化MNIST之降维探索Visualizing MNIST: An Exploration of Dimensionality Reduction

    At some fundamental level, no one understands machine learning. It isn’t a matter of things being to ...

  3. Stanford机器学习笔记-10. 降维(Dimensionality Reduction)

    10. Dimensionality Reduction Content  10. Dimensionality Reduction 10.1 Motivation 10.1.1 Motivation ...

  4. [Scikit-learn] 4.4 Dimensionality reduction - PCA

    2.5. Decomposing signals in components (matrix factorization problems) 2.5.1. Principal component an ...

  5. 第八章——降维(Dimensionality Reduction)

    机器学习问题可能包含成百上千的特征.特征数量过多,不仅使得训练很耗时,而且难以找到解决方案.这一问题被称为维数灾难(curse of dimensionality).为简化问题,加速训练,就需要降维了 ...

  6. 壁虎书8 Dimensionality Reduction

    many Machine Learning problems involve thousands or even millions of features for each training inst ...

  7. [UFLDL] Dimensionality Reduction

    博客内容取材于:http://www.cnblogs.com/tornadomeet/archive/2012/06/24/2560261.html Deep learning:三十五(用NN实现数据 ...

  8. 单细胞数据高级分析之初步降维和聚类 | Dimensionality reduction | Clustering

    个人的一些碎碎念: 聚类,直觉就能想到kmeans聚类,另外还有一个hierarchical clustering,但是单细胞里面都用得不多,为什么?印象中只有一个scoring model是用kme ...

  9. 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 14—Dimensionality Reduction 降维

    Lecture 14 Dimensionality Reduction 降维 14.1 降维的动机一:数据压缩 Data Compression 现在讨论第二种无监督学习问题:降维. 降维的一个作用是 ...

随机推荐

  1. asp.net 错误24

    错误 24 “xxx.Web.xxx.xxx”不包含“xxName”的定义,并且找不到可接受类型为“xxx.Web.xxxr.xxx”的第一个参数的扩展方法“xxxName”(是否缺少 using 指 ...

  2. 复利计算测试(C语言)

    对我们和复利计算程序,写单元测试. 有哪些场景? 期待的返回值 写测试程序. 运行测试. 测试模块 测试输入 预期结果 运行结果 bug跟踪 计算终值 (本金,年限,利率) 终值     1 (100 ...

  3. selenium异常问题汇总(持续更新版)

    webdriver启动firefox时如果遇到以下错误,则说明selenium的版本和firefox不兼容了,升级selenium版本就好 org.openqa.selenium.firefox.No ...

  4. TP 等框架在配置虚拟主机伪静态注意事项

    在配置虚拟主机的伪静态 .htaccess 文件时需要注意 ? 符号的添加与否处理 例如TP 框架来讲 无 ? 符号,在有的虚拟主机中必须这样配置,否则无法解析(试了多个公司的虚拟主机只要加了?符号就 ...

  5. windows多线程(九) PV原语分析同步问题

    一.PV原语介绍 PV原语通过操作信号量来处理进程间的同步与互斥的问题.其核心就是一段不可分割不可中断的程序. 信号量的概念1965年由著名的荷兰计算机科学家Dijkstra提出,其基本思路是用一种新 ...

  6. 初入码田--ASP.NET MVC4 Web应用之创建一个空白的MVC应用程序

    初入码田--ASP.NET MVC4 Web应用开发之一  实现简单的登录 初入码田--ASP.NET MVC4 Web应用开发之二 实现简单的增删改查 2016-07-29 在此之前,需要一台电脑( ...

  7. 深入理解JAVA虚拟机阅读笔记5——Java内存模型与线程

    Java内存模型是定义线程共享的变量的访问规则(实例字段.静态字段和构成数组对象的元素),但不包括线程私有的局部变量和方法参数. 1.主内存与工作内存 Java内存模型规定,所有的变量都必须存储在主内 ...

  8. python3+selenium 牛刀小试

    # coding:utf-8 # __author__ = 'Carry' import unittest from selenium import webdriver import time cla ...

  9. jquery 半透明遮罩效果 小结

    最近偏离学术的道路越来越远了!! 今天要小结的是实现一个半透明遮罩效果.点击页面上的一个按钮,立即在屏幕的正中央显示某个部件,并且在这个部件之外的区域像是蒙上了一层半透明的遮罩.点击遮罩区域,该正中央 ...

  10. 【刷题】BZOJ 4530 [Bjoi2014]大融合

    Description 小强要在N个孤立的星球上建立起一套通信系统.这套通信系统就是连接N个点的一个树. 这个树的边是一条一条添加上去的.在某个时刻,一条边的负载就是它所在的当前能够 联通的树上路过它 ...