Recently, I was writing module of feature engineering, i found two excellently packages -- tsfresh and sklearn.

tsfresh has been specialized for data of time series, tsfresh mainly include two modules, feature extract, and feature select:

 from tsfresh import feature_selection, feature_extraction

To limit the number of irrelevant features tsfresh deploys the fresh algorithms. The whole process consists of three steps.

Firstly.  the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.

In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those tests are contained in submodule tsfresh.feature_selection.significance_tests. the result of a significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.

Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.

In summary, the tsfresh is a scalable and efficiency tool of feature engineering.

although the function of tsfresh was powerful, i choose sklearn.

I download data which is the heart disease data set. the data set target is binary and has a 13 dimension feature, I was just used MinMaxScaler to transform age,trestbps,chol three columns, the model had a choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two models.

from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from numpy import set_printoptions, inf
set_printoptions(threshold=inf)
import pandas as pd
data = pd.read_csv("../data_set/heart.csv")
X = data[data.columns[:data.shape[1] - 1]].values
y = data[data.columns[-1]].values data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])
X[:, [0, 3, 4, 7]] = data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from autosklearn.classification import AutoSklearnClassifier
model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)
model_auto.fit(x_train, y_train) from sklearn.metrics import accuracy_score
y_pred = model_auto.predict(x_test)
accuracy_score(y_test, y_pred) >>> 0.8021978021978022 from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500)
y_pred_rf = model.predict(x_test)
accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352

My personal web site which provides automl service, I upload this data set to my service, it gets a better score than my code: http://simple-code.cn/

The sklearn preprocessing的更多相关文章

  1. 数据规范化——sklearn.preprocessing

    sklearn实现---归类为5大类 sklearn.preprocessing.scale()(最常用,易受异常值影响) sklearn.preprocessing.StandardScaler() ...

  2. 【sklearn】数据预处理 sklearn.preprocessing

    数据预处理 标准化 (Standardization) 规范化(Normalization) 二值化 分类特征编码 推定缺失数据 生成多项式特征 定制转换器 1. 标准化Standardization ...

  3. sklearn.preprocessing.LabelBinarizer

    sklearn.preprocessing.LabelBinarizer

  4. sklearn.preprocessing.LabelEncoder的使用

    在训练模型之前,我们通常都要对训练数据进行一定的处理.将类别编号就是一种常用的处理方法,比如把类别"男","女"编号为0和1.可以使用sklearn.prepr ...

  5. sklearn preprocessing (预处理)

    预处理的几种方法:标准化.数据最大最小缩放处理.正则化.特征二值化和数据缺失值处理. 知识回顾: p-范数:先算绝对值的p次方,再求和,再开p次方. 数据标准化:尽量将数据转化为均值为0,方差为1的数 ...

  6. 11.sklearn.preprocessing.LabelEncoder的作用

    In [5]: from sklearn import preprocessing ...: le =preprocessing.LabelEncoder() ...: le.fit(["p ...

  7. sklearn学习笔记(一)——数据预处理 sklearn.preprocessing

    https://blog.csdn.net/zhangyang10d/article/details/53418227 数据预处理 sklearn.preprocessing 标准化 (Standar ...

  8. sklearn.preprocessing.StandardScaler 离线使用 不使用pickle如何做

    Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters: scale_ ...

  9. sklearn.preprocessing OneHotEncoder——仅仅是数值型字段才可以,如果是字符类型字段则不能直接搞定

    >>> from sklearn.preprocessing import OneHotEncoder >>> enc = OneHotEncoder() > ...

  10. pandas 下的 one hot encoder 及 pd.get_dummies() 与 sklearn.preprocessing 下的 OneHotEncoder 的区别

    sklearn.preprocessing 下除了提供 OneHotEncoder 还提供 LabelEncoder(简单地将 categorical labels 转换为不同的数字): 1. 简单区 ...

随机推荐

  1. Webpack之optimization.splitChunks代码分割插件的配置

    SplitChunkPlugin插件配置参数详解 对引入的库代码(例如:lodash.jQuery等)进行代码的分割进行优化 若配置时只写chunks:"all",其余则为默认配置 ...

  2. pikachu-越权漏洞(Over Permission)

    一.越权漏洞概述 1.1 概述     由于没有用户权限进行严格的判断,导致低权限的账户(例如普通用户)可以去完成高权限账户(例如管理员账户)范围内的操作. 1.2 越权漏洞的分类 (1)平行越权   ...

  3. Android埋点方案的简单实现-AOP之AspectJ

    个人博客 http://www.milovetingting.cn Android埋点方案的简单实现-AOP之AspectJ AOP的定义 AOP为Aspect Oriented Programmin ...

  4. AndroidStudio报错:Could not download gradle.jar:No cacahed version available for offline mode

    场景 在讲Android Studio 的.gradle目录从默认C盘修改为 别的目录后,新建app提示: Could not download gradle.jar:No cacahed versi ...

  5. docker jenkins 前端node项目 自动化部署异常 env: ‘node’: No such file or directory

    出现问题是docker jenkins 里面没有自动安装node导致找不到这个Node命令 解决方案:手动安装nodejs # 进入jenkins对应容器中 # docker exec -it [对应 ...

  6. Maven 父子工程的一些细节

    Project,项目,也叫做工程. 父子工程中,子模块会自动继承父工程的资源.依赖,但子模块之间是独立的,不能直接访问彼此中的资源.类. 就是说我们可以把多个子模块都要用的资源.依赖提出来,放到父工程 ...

  7. ORACLE ANALYZE使用小结

      ANALYZE的介绍     使用ANALYZE可以收集或删除对象的统计信息.验证对象的结构.标识表或cluster中的行迁移/行链接信息等.官方文档关于ANALYZE功能介绍如下: ·      ...

  8. 0x01 C语言-编写第一个hello world

    学习每一个编程语言都是从 "Hello world!" 开始的,这好像就是编程界一条不成文的规定一样. 在这篇文章中,我将教大家编写一个可以输出 "Hello world ...

  9. .NET CORE(C#) WPF 重新设计Instagram

    微信公众号:Dotnet9,网站:Dotnet9,问题或建议:请网站留言, 如果对您有所帮助:欢迎赞赏. .NET CORE(C#) WPF 重新设计Instagram 阅读导航 本文背景 代码实现 ...

  10. ubuntu18.04误删apt-get命令恢复总结

    1.背景 由于使用aptitude命令替换了apt-get命令后感到后悔,想要恢复apt-get命令,特此总结以下踩过的坑 aptitude和apt-get的区别:https://www.cnblog ...