神经网络中的数据预处理方法 Data Preprocessing
0、Principal component analysis (PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation). The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
1. Rescale Data
When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
You can rescale your data using scikit-learn using the MinMaxScaler class.
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
After rescaling you can see that all of the values are in the range between 0 and 1.
[[ 0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]
[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]
2. Standardize Data
Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.
You can standardize data using scikit-learn with the StandardScaler class.
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
The values for each attribute now have a mean value of 0 and a standard deviation of 1.
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]
3. Normalize Data
Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).
This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.
You can normalize data in Python with scikit-learn using the Normalizer class.
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
The rows are normalized to length 1.
[[ 0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]
[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]
4. Binarize Data (Make Binary)
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.
You can create new binary attributes in Python using scikit-learn with the Binarizer class.
# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])
You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.
[[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]
Summary
In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.
You now have recipes to:
- Rescale data.
- Standardize data.
- Normalize data.
- Binarize data.
Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.
神经网络中的数据预处理方法 Data Preprocessing的更多相关文章
- sklearn中常用数据预处理方法
1. 标准化(Standardization or Mean Removal and Variance Scaling) 变换后各维特征有0均值,单位方差.也叫z-score规范化(零均值规范化).计 ...
- python中常用的九种数据预处理方法分享
Spyder Ctrl + 4/5: 块注释/块反注释 本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(St ...
- sklearn中的数据预处理和特征工程
小伙伴们大家好~o( ̄▽ ̄)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...
- 机器学习实战基础(八):sklearn中的数据预处理和特征工程(一)简介
1 简介 数据挖掘的五大流程: 1. 获取数据 2. 数据预处理 数据预处理是从数据中检测,纠正或删除损坏,不准确或不适用于模型的记录的过程 可能面对的问题有:数据类型不同,比如有的是文字,有的是数字 ...
- 返回数据中提取数据的方法(JSON数据取其中某一个值的方法)
返回数据中提取数据的方法 比如下面的案例是,取店铺名称 接口返回数据如下: {"Code":0,"Msg":"ok","Data& ...
- 最简单删除SQL Server中所有数据的方法
最简单删除SQL Server中所有数据的方法 编写人:CC阿爸 2014-3-14 其实删除数据库中数据的方法并不复杂,为什么我还要多此一举呢,一是我这里介绍的是删除数据库的所有数据,因为数据之间 ...
- 关于VUE调用父实例($parent) 根实例 中的数据和方法
this.$parent或者 this.$root 在子组件中判断this.$parent获取的实例是不是父组件的实例 在子组件中console.log(this.$parent) 在父组件中con ...
- 使用JDBC从数据库中查询数据的方法
* ResultSet 结果集:封装了使用JDBC 进行查询的结果 * 1. 调用Statement 对象的 executeQuery(sql) 方法可以得到结果集 * 2. ResultSet 返回 ...
- 机器学习实战基础(九):sklearn中的数据预处理和特征工程(二) 数据预处理 Preprocessing & Impute 之 数据无量纲化
1 数据无量纲化 在机器学习算法实践中,我们往往有着将不同规格的数据转换到同一规格,或不同分布的数据转换到某个特定分布的需求,这种需求统称为将数据“无量纲化”.譬如梯度和矩阵为核心的算法中,譬如逻辑回 ...
随机推荐
- Elasticsearch分析聚合
原文链接:http://blog.csdn.net/napoay/article/details/53484730
- c# 多线程里面创建byte数组发生内存溢出异常求解
在多线程里面读取一个400多M的Xml文件,首先将其读入FileStream里面,然后,在执行 byte [] bts = new byte[fs.Length]; 这句代码时,出现内存溢出的异常,求 ...
- spring 多个数据库之间切换
多数据源问题很常见,例如读写分离数据库配置. 原来的项目出现了新需求,局方要求新增某服务器用以提供某代码,涉及到多数据源的问题. 研究成果如下: 1.首先配置多个datasource [html] v ...
- PHP——通过下拉列表选择时间(转)
实现效果: 主页代码: <script type="text/javascript" src="jquery.min.js"></script ...
- UVA 12034 Race (递推神马的)
Disky and Sooma, two of the biggest mega minds of Bangladesh went to a far country. They ate, coded ...
- OSX 10.8+下开启Web共享的方法 /转
OSX 10.8+ Mountain Lion 下开启 Web Sharing(Web 共享)的方法 JUL 28, 2012 #OS X #how-to #apache #web #sha ...
- 在一个千万级的数据库查寻中,如何提高查询效率?分别说出在数据库设计、SQL语句、java等层面的解决方案。
在一个千万级的数据库查寻中,如何提高查询效率?分别说出在数据库设计.SQL语句.java等层面的解决方案. 解答: 1)数据库设计方面: a. 对查询进行优化,应尽量避免全表扫描,首先应考虑在 whe ...
- HTML表单页面的运用
本章目标:掌握表单基本结构<form> 掌握各种表单元素 能理解post和get两种提交方式的区别 本章重点:掌握各种表单元素 本章难点:post和get两种提交方式的区别 一. H ...
- 【vijos】1763 Wormhole(贪心)
https://vijos.org/p/1764 首先第一个虫洞一定是建在1号点. 证明如下: 假设一个虫洞在a,一个在b,a<b,那么走到k点的最短距离为 min{|x1-xk|, |x1-x ...
- 如何连接OracleRAC
查看tnsname 查看服务器上tnsname.ora内容: 位置:/oracle/db/product/11.2.0/network/admin/tnsname.ora 连接rac 根据以上信息 ...