The sklearn preprocessing

Recently， I was writing module of feature engineering, i found two excellently packages -- tsfresh and sklearn.

tsfresh has been specialized for data of time series, tsfresh mainly include two modules, feature extract, and feature select:

 from tsfresh import feature_selection, feature_extraction

To limit the number of irrelevant features tsfresh deploys the fresh algorithms. The whole process consists of three steps.

Firstly. the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.

In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those tests are contained in submodule tsfresh.feature_selection.significance_tests. the result of a significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.

Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.

In summary, the tsfresh is a scalable and efficiency tool of feature engineering.

although the function of tsfresh was powerful, i choose sklearn.

I download data which is the heart disease data set. the data set target is binary and has a 13 dimension feature, I was just used MinMaxScaler to transform age,trestbps,chol three columns, the model had a choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two models.

from sklearn.preprocessing import MinMaxScaler,StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from numpy import set_printoptions, inf

set_printoptions(threshold=inf)

import pandas as pd

data = pd.read_csv("../data_set/heart.csv")

X = data[data.columns[:data.shape[1] - 1]].values

y = data[data.columns[-1]].values

data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])

X[:, [0, 3, 4, 7]] = data

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from autosklearn.classification import AutoSklearnClassifier

model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)

model_auto.fit(x_train, y_train)

from sklearn.metrics import accuracy_score

y_pred = model_auto.predict(x_test)

accuracy_score(y_test, y_pred)   >>> 0.8021978021978022

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=500)

y_pred_rf = model.predict(x_test)

accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352

My personal web site which provides automl service, I upload this data set to my service, it gets a better score than my code: http://simple-code.cn/

The sklearn preprocessing的更多相关文章

数据规范化——sklearn.preprocessing
sklearn实现---归类为5大类 sklearn.preprocessing.scale()(最常用,易受异常值影响) sklearn.preprocessing.StandardScaler() ...
【sklearn】数据预处理 sklearn.preprocessing
数据预处理标准化 (Standardization) 规范化(Normalization) 二值化分类特征编码推定缺失数据生成多项式特征定制转换器 1. 标准化Standardization ...
sklearn.preprocessing.LabelBinarizer
sklearn.preprocessing.LabelBinarizer
sklearn.preprocessing.LabelEncoder的使用
在训练模型之前,我们通常都要对训练数据进行一定的处理.将类别编号就是一种常用的处理方法,比如把类别"男","女"编号为0和1.可以使用sklearn.prepr ...
sklearn preprocessing （预处理）
预处理的几种方法:标准化.数据最大最小缩放处理.正则化.特征二值化和数据缺失值处理. 知识回顾: p-范数:先算绝对值的p次方,再求和,再开p次方. 数据标准化:尽量将数据转化为均值为0,方差为1的数 ...
11.sklearn.preprocessing.LabelEncoder的作用
In [5]: from sklearn import preprocessing ...: le =preprocessing.LabelEncoder() ...: le.fit(["p ...
sklearn学习笔记（一）——数据预处理 sklearn.preprocessing
https://blog.csdn.net/zhangyang10d/article/details/53418227 数据预处理 sklearn.preprocessing 标准化 (Standar ...
sklearn.preprocessing.StandardScaler 离线使用不使用pickle如何做
Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters: scale_ ...
sklearn.preprocessing OneHotEncoder——仅仅是数值型字段才可以，如果是字符类型字段则不能直接搞定
>>> from sklearn.preprocessing import OneHotEncoder >>> enc = OneHotEncoder() > ...
pandas 下的 one hot encoder 及 pd.get_dummies() 与 sklearn.preprocessing 下的 OneHotEncoder 的区别
sklearn.preprocessing 下除了提供 OneHotEncoder 还提供 LabelEncoder(简单地将 categorical labels 转换为不同的数字): 1. 简单区 ...

随机推荐

java方法参数传递方式只有----值传递！
在通常的说法中,方法参数的传递分为两种,值传递和引用传递,值传递是指将实际参数复制一份传递到方法中, 在方法中的改动将不会影响到实际参数本身,而引用传递则是指传递的是实际参数本身,在方法中的改动将会影 ...
Python2-Django配置阿里大于的短信验证码接口
1.短信发送开发指南地址:https://help.aliyun.com/document_detail/55491.html?spm=a2c4g.11186623.6.568.l5zTwH 2.SD ...
Leetcode：110. 平衡二叉树
Leetcode:110. 平衡二叉树 Leetcode:110. 平衡二叉树点链接就能看到原题啦~ 关于AVL的判断函数写法,请跳转:平衡二叉树的判断废话不说直接上代码吧~主要的解析的都在上面的 ...
kms在线激活windows和office
本激活,只适用vol版本的windows系统和office 激活windows在windows中使用管理员方式打开cmd命令输入slmgr /skms chongking.com切换kms服务器地址为 ...
使用Nginx对.NetCore站点进行反向代理
前言之前的博客我已经在Linux上部署好了.NetCore站点且通过Supervisor对站点进行了进程守护,同时也安装好了Nginx.Nginx的用处非常大,还是简单说下,它最大的功能就是方便我们 ...
search（0）- 企业搜索，写在前面
计划研究一下搜索search,然后写个学习过程系列博客.开动之前先说说学习搜索的目的:不是想开发个什么搜索引擎,而是想用现成的搜索引擎在传统信息系统中引进搜索的概念和方法.对我来说,传统的管理系统le ...
伪造TGT黄金票据
通过上一篇文章我们初步了解了Kerberos协议的工作过程,解决的两个问题第一个问题:如何证明你本人是XXX用户的问题由Authentication Server负责第二个问题:提供服务的服 ...
「Flink」事件时间与水印
我们先来以滚动时间窗口为例,来看一下窗口的几个时间参数与Flink流处理系统时间特性的关系. 获取窗口开始时间Flink源代码获取窗口的开始时间为以下代码: org.apache.flink.str ...
SOA(Service-Oriented Architecture):面向服务的架构
SOA (Service-Oriented Architecture):面向服务的架构(SOA)是一个组件模型,它将应用程序的不同功能单元(称为服务)进行拆分,并通过这些服务之间定义良好的接口和协议联 ...
在视觉可视化中如何使用ScaleBreaks-比例中断
从lightningChart V8开始,这项图表控件产品开始支持X轴的Scale break功能. 这个功能的主要作用是排除选定的X轴范围,例如互动交易时间/日期或者机器停产时间等.如果有一部分的数 ...

The sklearn preprocessing

The sklearn preprocessing的更多相关文章

随机推荐

热门专题