0、Principal component analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation). The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

After rescaling you can see that all of the values are in the range between 0 and 1.

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

The rows are normalized to length 1.

[[ 0.034  0.828  0.403  0.196  0.     0.188  0.004  0.28 ]
[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

  • Rescale data.
  • Standardize data.
  • Normalize data.
  • Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

神经网络中的数据预处理方法 Data Preprocessing的更多相关文章

  1. sklearn中常用数据预处理方法

    1. 标准化(Standardization or Mean Removal and Variance Scaling) 变换后各维特征有0均值,单位方差.也叫z-score规范化(零均值规范化).计 ...

  2. python中常用的九种数据预处理方法分享

    Spyder   Ctrl + 4/5: 块注释/块反注释 本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(St ...

  3. sklearn中的数据预处理和特征工程

    小伙伴们大家好~o( ̄▽ ̄)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...

  4. 机器学习实战基础(八):sklearn中的数据预处理和特征工程(一)简介

    1 简介 数据挖掘的五大流程: 1. 获取数据 2. 数据预处理 数据预处理是从数据中检测,纠正或删除损坏,不准确或不适用于模型的记录的过程 可能面对的问题有:数据类型不同,比如有的是文字,有的是数字 ...

  5. 返回数据中提取数据的方法(JSON数据取其中某一个值的方法)

    返回数据中提取数据的方法 比如下面的案例是,取店铺名称 接口返回数据如下: {"Code":0,"Msg":"ok","Data& ...

  6. 最简单删除SQL Server中所有数据的方法

     最简单删除SQL Server中所有数据的方法 编写人:CC阿爸 2014-3-14 其实删除数据库中数据的方法并不复杂,为什么我还要多此一举呢,一是我这里介绍的是删除数据库的所有数据,因为数据之间 ...

  7. 关于VUE调用父实例($parent) 根实例 中的数据和方法

    this.$parent或者 this.$root 在子组件中判断this.$parent获取的实例是不是父组件的实例 在子组件中console.log(this.$parent)  在父组件中con ...

  8. 使用JDBC从数据库中查询数据的方法

    * ResultSet 结果集:封装了使用JDBC 进行查询的结果 * 1. 调用Statement 对象的 executeQuery(sql) 方法可以得到结果集 * 2. ResultSet 返回 ...

  9. 机器学习实战基础(九):sklearn中的数据预处理和特征工程(二) 数据预处理 Preprocessing & Impute 之 数据无量纲化

    1 数据无量纲化 在机器学习算法实践中,我们往往有着将不同规格的数据转换到同一规格,或不同分布的数据转换到某个特定分布的需求,这种需求统称为将数据“无量纲化”.譬如梯度和矩阵为核心的算法中,譬如逻辑回 ...

随机推荐

  1. java @override 报错处理

    转载自:http://blog.sina.com.cn/s/blog_9c7605530101kl9r.html 一.java @override 报错处理 做项目的时候,同事那边电脑上编译通过的ja ...

  2. 常用的AJAX框架

    你有没有想过设计你的网站像桌面应用程序?幸运的是,使用AJAX,我们可以做到这一点.通过使用AJAX,当我们只想更新网站的一部分(如天气信息或新闻面板)时,我们无需刷新整个页面.这使我们的网络应用看起 ...

  3. 【PM日记】处理事务的逻辑

    首先你得时刻搞清楚在你的当下什么类型事情是最重要的,是与人交流,是推进项目,还是需要更加埋头学习知识. 每天你得有个list,可以是上一日遗留下来的部分未完成项,可以是idea收集箱中拿到的新任务,总 ...

  4. 第二百二十四节,jQuery EasyUI,ComboGrid(数据表格下拉框)组件

    jQuery EasyUI,ComboGrid(数据表格下拉框)组件 学习要点: 1.加载方式 2.属性列表 3.方法列表 本节课重点了解 EasyUI 中 ComboGrid(数据表格下拉框)组件的 ...

  5. jquery如何判断表格同一列不同行input数据是否重复

    function hasRepeat(objId,columnIndex){ var arr = []; $("#"+objId+" tbody tr").ea ...

  6. 客户端用JavaScript填充DropDownList控件 服务器端读不到值

    填充没有任何问题,但是在服务器端却取不出来下拉表中的内容.页面代码如下. <form id="form1" runat="server"> < ...

  7. poj 2662(Dijkstra+记忆化)

    题目链接:http://poj.org/problem?id=2662 思路:首先路径的选择,如果B点到终点的距离比A点到终点的最短距离短,那么就从A走到B,换句话说,就是每次都是择优选择更靠近终点的 ...

  8. iOS -转载-使用Navicat查看数据表的ER关系图

    Navicat软件真是一个好东西.今天需要分析一个数据库,然后想看看各个表之间的关系,所以需要查看表与表之间的关系图,专业术语叫做ER关系图. 默认情况下,Navicat显示的界面是这样的: 软件将表 ...

  9. Bower和Gulp集成前端资源

    在我们开始前先介绍下流程: 安装node.js. 安装npm. 全局安装bower. 根目录创建 .bowerrc (可选) 在项目中安装bower 并创建 bower.json 文件,运行 bowe ...

  10. 上传图片到数据库,读取数据库中图片并显示(C#)

    from:http://blog.csdn.net/bfcady/article/details/2622701 思路:建立流对象,将上传图片临时保存到byte数组中,再用SQL语句将其保存到数据库中 ...