The sklearn preprocessing
Recently, I was writing module of feature engineering, i found two excellently packages -- tsfresh and sklearn.
tsfresh has been specialized for data of time series, tsfresh mainly include two modules, feature extract, and feature select:
from tsfresh import feature_selection, feature_extraction
To limit the number of irrelevant features tsfresh deploys the fresh algorithms. The whole process consists of three steps.
Firstly. the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.
In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those tests are contained in submodule tsfresh.feature_selection.significance_tests. the result of a significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.
Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.
In summary, the tsfresh is a scalable and efficiency tool of feature engineering.
although the function of tsfresh was powerful, i choose sklearn.
I download data which is the heart disease data set. the data set target is binary and has a 13 dimension feature, I was just used MinMaxScaler to transform age,trestbps,chol three columns, the model had a choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two models.
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from numpy import set_printoptions, inf
set_printoptions(threshold=inf)
import pandas as pd
data = pd.read_csv("../data_set/heart.csv")
X = data[data.columns[:data.shape[1] - 1]].values
y = data[data.columns[-1]].values data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])
X[:, [0, 3, 4, 7]] = data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from autosklearn.classification import AutoSklearnClassifier
model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)
model_auto.fit(x_train, y_train) from sklearn.metrics import accuracy_score
y_pred = model_auto.predict(x_test)
accuracy_score(y_test, y_pred) >>> 0.8021978021978022 from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500)
y_pred_rf = model.predict(x_test)
accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352
My personal web site which provides automl service, I upload this data set to my service, it gets a better score than my code: http://simple-code.cn/
The sklearn preprocessing的更多相关文章
- 数据规范化——sklearn.preprocessing
sklearn实现---归类为5大类 sklearn.preprocessing.scale()(最常用,易受异常值影响) sklearn.preprocessing.StandardScaler() ...
- 【sklearn】数据预处理 sklearn.preprocessing
数据预处理 标准化 (Standardization) 规范化(Normalization) 二值化 分类特征编码 推定缺失数据 生成多项式特征 定制转换器 1. 标准化Standardization ...
- sklearn.preprocessing.LabelBinarizer
sklearn.preprocessing.LabelBinarizer
- sklearn.preprocessing.LabelEncoder的使用
在训练模型之前,我们通常都要对训练数据进行一定的处理.将类别编号就是一种常用的处理方法,比如把类别"男","女"编号为0和1.可以使用sklearn.prepr ...
- sklearn preprocessing (预处理)
预处理的几种方法:标准化.数据最大最小缩放处理.正则化.特征二值化和数据缺失值处理. 知识回顾: p-范数:先算绝对值的p次方,再求和,再开p次方. 数据标准化:尽量将数据转化为均值为0,方差为1的数 ...
- 11.sklearn.preprocessing.LabelEncoder的作用
In [5]: from sklearn import preprocessing ...: le =preprocessing.LabelEncoder() ...: le.fit(["p ...
- sklearn学习笔记(一)——数据预处理 sklearn.preprocessing
https://blog.csdn.net/zhangyang10d/article/details/53418227 数据预处理 sklearn.preprocessing 标准化 (Standar ...
- sklearn.preprocessing.StandardScaler 离线使用 不使用pickle如何做
Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters: scale_ ...
- sklearn.preprocessing OneHotEncoder——仅仅是数值型字段才可以,如果是字符类型字段则不能直接搞定
>>> from sklearn.preprocessing import OneHotEncoder >>> enc = OneHotEncoder() > ...
- pandas 下的 one hot encoder 及 pd.get_dummies() 与 sklearn.preprocessing 下的 OneHotEncoder 的区别
sklearn.preprocessing 下除了提供 OneHotEncoder 还提供 LabelEncoder(简单地将 categorical labels 转换为不同的数字): 1. 简单区 ...
随机推荐
- Spring(七)核心容器 - 钩子接口
目录 前言 1.Aware 系列接口 2.InitializingBean 3.BeanPostProcessor 4.BeanFactoryPostProcessor 5.ImportSelecto ...
- SAP 如何看某个TR是否传入了Q或者P系统?
SAP 如何看某个TR是否传入了Q或者P系统? 两种方式可以查询. 1)进入Q系统或者P系统.SE16,看表TPALOG, 输入请求号码, 执行,看记录里的字段TPSTAT_KEY是否为空,如果不为空 ...
- centos7使用fdisk:创建和维护MBR分区表
1.在VMware选择要添加硬盘的虚拟机,添加一块硬盘. 这样就有两块硬盘 2.重启虚拟机. cd /dev #dev是设备目录,下面有很多很多设备,其中就包含硬盘 ll grep | sd #过滤s ...
- linux中shell内置命令和外置命令
shell内置命令 无法通过which或者whereis去查找命令的位置 例如cd,cp这些命令是shell解释器内置的命令 当shell内置命令传入shell解释器,shell解释器通过内核获取相关 ...
- phpcms v9编辑器上传图片是否添加水印
第一步:给图片上传对话框里面添加是否添加水印的多选框,找到: satics/js/ckeditor/ckeditor.js 第17554行 (需要格式化,我用的NetBeans)修改为 functio ...
- day 16内置函数总结
reversed()l = [1,2,3,4,5]l.reverse()print(l) l = [1,2,3,4,5]l2 = reversed(l)reversed:更加节省内存资源print(l ...
- C# WPF简况(2/3)
微信公众号:Dotnet9,网站:Dotnet9,问题或建议:请网站留言, 如果对您有所帮助:欢迎赞赏. C# WPF简况(2/3) 阅读导航 本文背景 代码实现 本文参考 1.本文背景 承接上文(C ...
- Charles抓包工具的破解以及使用
一.破解 官网下载Charles 下载Charles.jar ,然后按照后在Charles→lib中替换掉Charles.jar 链接:https://pan.baidu.com/s/1XZ-aZI5 ...
- 关于Synchronized研伸扩展
代码1 synchronized方法 synchronized void method(){ .......... } 代码2 synchronized代码块 synchronized (obj){ ...
- Selenium实战(六)——数据驱动应用
一.数据驱动 由于大多数文章和资料都把“读取数据文件”看做数据驱动的标志,下面创建一个baidu_data.csv文件: 文件第一列为测试用例名称,第二列为搜索的关键字.接下来创建test_baidu ...