python data analysis | python数据预处理（基于scikit-learn模块）

原文：http://www.jianshu.com/p/94516a58314d

Dataset transformations| 数据转换

Combining estimators|组合学习器

Feature extration|特征提取

Preprocessing data|数据预处理

1 Dataset transformations

scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

scikit-learn 提供了数据转换的模块，包括数据清理、降维、扩展和特征提取。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

scikit-learn模块有3种通用的方法：fit(X,y=None)、transform(X)、fit_transform(X)、inverse_transform(newX)。fit用来训练模型；transform在训练后用来降维；fit_transform先用训练模型，然后返回降维后的X；inverse_transform用来将降维后的数据转换成原始数据。

1.1 combining estimators

1.1.1 Pipeline:chaining estimators

Pipeline 模块是用来组合一系列估计器的。对固定的一系列操作非常便利，如：同时结合特征选择、数据标准化、分类。

Usage|使用
代码：

from sklearn.pipeline import Pipeline

from sklearn.svm import SVC

from sklearn.decomposition import PCA

from sklearn.pipeline import make_pipeline

#define estimators

#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an estimators object

estimators=[('reduce_dim',PCA()),('svm',SVC())]

#combine estimators

clf1=Pipeline(estimators)

clf2=make_pipeline(PCA(),SVC())  #use func make_pipeline() can do the same thing

print(clf1,'\n',clf2)

输出：

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm',           SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False))])

Pipeline(steps=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False))])

可以通过set_params()方法设置学习器的属性，参数形式为<estimator>_<parameter>

clf.set_params(svm__C=10)

上面的方法在网格搜索时很重要：

from sklearn.grid_search import GridSearchCV

params = dict(reduce_dim__n_components=[2, 5, 10],svm__C=[0.1, 10, 100])

grid_search = GridSearchCV(clf, param_grid=params)

上面的例子相当于把pipeline生成的学习器作为一个普通的学习器，参数形式为<estimator>_<parameter>。

Note|说明
1.可以使用dir()函数查看clf的所有属性和方法。例如step属性就是每个操作步骤的属性。
如
>>> clf.steps[0] ('reduce_dim', PCA(copy=True, n_components=None, whiten=False))
2.调用pipeline生成的学习器的fit方法相当于依次调用其包含的所有学习器的方法，transform输入然后把结果扔向下一步骤。pipeline生成的学习器有着它包含的学习器的所有方法。如果最后一个学习器是分类，那么生成的学习器就是分类，如果最后一个是transform，那么生成的学习器就是transform，依次类推。

1.1.2 FeatureUnion: composite feature spaces

与pipeline不同的是FeatureUnion只组合transformer，它们也可以结合成更复杂的模型。

FeatureUnion combines several transformer objects into a new transformer that combines their output. AFeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.

Usage|使用
代码：

from sklearn.pipeline import FeatureUnion

from sklearn.decomposition import PCA

from sklearn.decomposition import KernelPCA

from sklearn.pipeline import make_union

#define transformers

#the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an transformer object

estimators=[('linear_pca)',PCA()),('Kernel_pca',KernelPCA())]

#combine transformers

clf1=FeatureUnion(estimators)

clf2=make_union(PCA(),KernelPCA())

print(clf1,'\n',clf2)

print(dir(clf1))

输出：

FeatureUnion(n_jobs=1,

   transformer_list=[('linear_pca)', PCA(copy=True, n_components=None, whiten=False)), ('Kernel_pca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',

 fit_inverse_transform=False, gamma=None, kernel='linear',

 kernel_params=None, max_iter=None, n_components=None,

 remove_zero_eig=False, tol=0))],

   transformer_weights=None)

FeatureUnion(n_jobs=1,

   transformer_list=[('pca', PCA(copy=True, n_components=None, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver='auto',

 fit_inverse_transform=False, gamma=None, kernel='linear',

 kernel_params=None, max_iter=None, n_components=None,

 remove_zero_eig=False, tol=0))],

   transformer_weights=None)

可以看出FeatureUnion的用法与pipeline一致

Note|说明

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

Here is a example python source code:feature_stacker.py

1.2 Feature extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

skilearn.feature_extraction模块是用机器学习算法所支持的数据格式来提取数据，如将text和image信息转换成dataset。
Note:
Feature extraction（特征提取）与Feature selection（特征选择）不同，前者是用来将非数值的数据转换成数值的数据，后者是用机器学习的方法对特征进行学习（如PCA降维）。

1.2.1 Loading features from dicts

The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict
objects to the NumPy/SciPy representation used by scikit-learn estimators.
Dictvectorizer类用来将python内置的dict类型转换成数值型的array。dict类型的好处是在存储稀疏数据时不用存储无用的值。

代码：

measurements=[{'city': 'Dubai', 'temperature': 33.}

,{'city': 'London', 'temperature':12.}

,{'city':'San Fransisco','temperature':18.},]

from sklearn.feature_extraction import DictVectorizer

vec=DictVectorizer()

x=vec.fit_transform(measurements).toarray()

print(x)

print(vec.get_feature_names())

输出：

[[  1.   0.   0.  33.]

[  0.   1.   0.  12.]

[  0.   0.   1.  18.]]

['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

[Finished in 0.8s]

1.2.2 Feature hashing
1.2.3 Text feature extraction
1.2.4 Image feature extraction

以上三小节暂未考虑（设计到语言处理及图像处理）见官方文档

1.3 Preprogressing data

The sklearn.preprocessing
package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators

sklearn.preprogressing模块提供了几种常见的数据转换，如标准化、归一化等。

1.3.1 Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

很多学习算法都要求事先对数据进行标准化，如果不是像标准正太分布一样0均值1方差就可能会有很差的表现。
- Usage|用法
代码：
```
from sklearn import preprocessing

import numpy as np

X = np.array([[1.,-1., 2.], [2.,0.,0.], [0.,1.,-1.]])

Y=X

Y_scaled = preprocessing.scale(Y)

y_mean=Y_scaled.mean(axis=0) #If 0, independently standardize each feature, otherwise (if 1) standardize each sample|axis=0 时求每个特征的均值，axis=1时求每个样本的均值

y_std=Y_scaled.std(axis=0)

print(Y_scaled)

scaler= preprocessing.StandardScaler().fit(Y)#用StandardScaler类也能完成同样的功能

print(scaler.transform(Y))
```
输出：
```
[[ 0.         -1.22474487  1.33630621]

[ 1.22474487  0.         -0.26726124]

[-1.22474487  1.22474487 -1.06904497]]

[[ 0.         -1.22474487  1.33630621]

[ 1.22474487  0.         -0.26726124]

[-1.22474487  1.22474487 -1.06904497]]

[Finished in 1.4s]
```
- Note|说明
  1.func scale
  2.class StandardScaler
  3.StandardScaler 是一种Transformer方法，可以让pipeline来使用。
  MinMaxScaler （min-max标准化[0,1]）类和MaxAbsScaler（[-1,1]）类是另外两个标准化的方式，用法和StandardScaler类似。
  4.处理稀疏数据时用MinMax和MaxAbs很合适
  5.鲁棒的数据标准化方法（适用于离群点很多的数据处理）：
  
  the median and the interquartile range often give better results
用中位数代替均值（使均值为0），用上四分位数-下四分位数代替方差（IQR为1？）。

1.3.2 Impution of missing values|缺失值的处理

Usage
代码：

import scipy.sparse as sp

from sklearn.preprocessing import Imputer

X=sp.csc_matrix([[1,2],[0,3],[7,6]])

imp=preprocessing.Imputer(missing_value=0,strategy='mean',axis=0)

imp.fit(X)

X_test=sp.csc_matrix([[0, 2], [6, 0], [7, 6]])

print(X_test)

print(imp.transform(X_test))

输出：

(1, 0)    6

(2, 0)    7

(0, 1)    2

(2, 1)    6

[[ 4.          2.        ]

[ 6.          3.66666675]

[ 7.          6.        ]]

[Finished in 0.6s]

Note
1.scipy.sparse是用来存储稀疏矩阵的
2.Imputer可以用来处理scipy.sparse稀疏矩阵

1.3.3 Generating polynomial features

Usage
代码：

import numpy as np

from sklearn.preprocessing import PolynomialFeatures

X=np.arange(6).reshape(3,2)

print(X)

poly=PolynomialFeatures(2)

print(poly.fit_transform(X))

输出：

[[0 1]

[2 3]

[4 5]]

[[  1.   0.   1.   0.   0.   1.]

[  1.   2.   3.   4.   6.   9.]

[  1.   4.   5.  16.  20.  25.]]

[Finished in 0.8s]

Note
生成多项式特征用在多项式回归中以及多项式核方法中。

1.3.4 Custom transformers

这是用来构造transform方法的函数

Usage:
代码：

import numpy as np

from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p)

x=np.array([[0,1],[2,3]])

print(transformer.transform(x))

输出：

[[ 0.          0.69314718]

[ 1.09861229  1.38629436]]

[Finished in 0.8s]

Note

For a full code example that demonstrates using a FunctionTransformer to do custom feature selection, see Using FunctionTransformer to select columns

文／houhzize（简书作者）
原文链接：http://www.jianshu.com/p/94516a58314d
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。

python data analysis | python数据预处理（基于scikit-learn模块）的更多相关文章

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
深入浅出数据分析 Head First Data Analysis Code 数据与代码
<深入浅出数据分析>英文名为Head First Data Analysis Code, 这本书中提供了学习使用的数据和程序,原书链接由于某些原因不能打开,这里在提供一个下载的链接.去下 ...
莫烦python教程学习笔记——数据预处理之normalization
# View more python learning tutorial on my Youtube and Youku channel!!! # Youtube video tutorial: ht ...
spark 数据预处理特征标准化归一化模块
#We will also standardise our data as we have done so far when performing distance-based clustering. ...
Python的工具包[1] -> pandas数据预处理 -> pandas 库及使用总结
pandas数据预处理 / pandas data pre-processing 目录关于 pandas pandas 库 pandas 基本操作 pandas 计算 pandas 的 Series ...
python中常用的九种数据预处理方法分享
Spyder Ctrl + 4/5: 块注释/块反注释本文总结的是我们大家在python中常见的数据预处理方法,以下通过sklearn的preprocessing模块来介绍; 1. 标准化(St ...
[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-02-加载数据：加载文件
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
[Python]-sklearn模块-机器学习Python入门《Python机器学习手册》-02-加载数据：加载数据集
<Python机器学习手册--从数据预处理到深度学习> 这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习 ...
sklearn中的数据预处理和特征工程
小伙伴们大家好~o(￣▽￣)ブ,沉寂了这么久我又出来啦,这次先不翻译优质的文章了,这次我们回到Python中的机器学习,看一下Sklearn中的数据预处理和特征工程,老规矩还是先强调一下我的开发环境是 ...

随机推荐

测试开发之前端——No8.HTML5中的媒介事件
媒介事件由视频.图像以及音频等媒介触发的事件. 适用于所有 HTML 5 元素,不过在媒介元素(诸如 audio.embed.img.object 以及 video)中最常用: 属性值描述 on ...
hdu1255扫描线计算覆盖两次面积
总体来说也是个模板题,但是要开两个线段树来保存被覆盖一次,两次的面积 #include<iostream> #include<cstring> #include<cstd ...
noip 2017 时间复杂度
自认为是少有的复杂的代码这题思想很简单,就是大模拟对于for循环,一行读入4个字符串,然后分类讨论: ①:如果是一个正常的O(n),那么累计n的指数加1 ②:如果是一个常数级别的,那么继续循环,但 ...
（七）dubbo服务集群实现负载均衡
当某个服务并发量特别大的时候,一个服务延迟太高,我们就需要进行服务集群,例如某个项目一天注册量10万,这个注册功能就必须要进行集群了,否则一个服务无法应付这么大的并发量: dubbo的服务集群很简单, ...
devexpress控件之ASPxCallback
ASPxCallback主要是通过注册客户端事件与服务器端事件来相互通信完成任务.ASPxCallback控件为我们封装了大量的Ajax操作,使用起来非常的方便,如果页面中遇到需要局部刷的操作而又不想 ...
python全栈开发day34-线程Thread
一.昨日内容回顾 1. 概念和理论进程是计算机资源分配最小单位进程三状态.同步.异步.阻塞.非阻塞 2. 进程的创建实例化.自建类run,start,join,terminate,daemon等 ...
BZOJ5071 小A的数字 BZOJ2017年10月月赛其他
欢迎访问~原文出处——博客园-zhouzhendong 去博客园看该题解题目传送门 - BZOJ5071 题意概括题解一开始蒙了. 感觉做过类似的题目. 但是找不到方法. 突然想到前缀和! 对于 ...
BZOJ1059 [ZJOI2007]矩阵游戏二分图匹配匈牙利算法
欢迎访问~原文出处——博客园-zhouzhendong 去博客园看该题解题目传送门 - BZOJ1059 题意概括有一个n*n(n<=200)的01矩阵,问你是否可以通过交换整行和整列使得左 ...
streaming优化：spark.streaming.receiver.maxRate
使用spark.streaming.receiver.maxRate来限制你的吞吐的最大信息量. 因为当streaming程序的数据源的数据量突然变大巨大,可能会导致streaming被撑住导致吞吐不 ...
java语言打印上三角和下三角，进一步得到九九乘法表
关于下面两种图形的打印问题 ***** 与 * **** ** *** *** ** **** * ***** 一:程序 1.先打印下三角 2.结果 3.后打印上三角 4.结果二:知识点 1.f ...

python data analysis | python数据预处理（基于scikit-learn模块）

python data analysis | python数据预处理（基于scikit-learn模块）的更多相关文章

随机推荐

热门专题