In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a series of steps in data preparation. Scikit-Learn provides the Pipeline class to help with such sequences of transformations.

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like.

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.

The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler, which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence (it also has a fit_transform method that we could have used instead of calling fit() and then transform()).

 from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
]) try:
from sklearn.compose import ColumnTransformer
except ImportError:
from future_encoders import ColumnTransformer # Scikit-Learn < 0.20 num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
]) housing_prepared = full_pipeline.fit_transform(housing)

[Machine Learning with Python] Data Preparation through Transformation Pipeline的更多相关文章

  1. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  5. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  6. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  7. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  8. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. 信号量和互斥量C语言示例理解线程同步

    Table of Contents 1. 线程同步 1.1. 用信号量进行同步 1.2. 用互斥量进行同步 2. 参考资料 线程同步 了解线程信号量的基础知识,对深入理解python的线程会大有帮助. ...

  2. postgreysql

    基础 syntax * \help 生成所有的pg命令 * abort 终止事务/work * alter aggregate 修改聚合函数的定义 ALTER AGGREGATE name ( typ ...

  3. 关于RelativeLayout设置垂直居中对齐不起作用的问题

    直接上代码 1.原有代码:(红色字体部分不起作用,无法让RelativeLayout中的textview居中) <RelativeLayout Android:id="@+id/aut ...

  4. loj2042 「CQOI2016」不同的最小割

    分治+最小割 看到题解的第一句话是这个就秒懂了,然后乱七八糟的错误.越界.RE-- #include <algorithm> #include <iostream> #incl ...

  5. ogre3D学习基础12 --- 让机器人动起来(移动模型动画)

    学了那么长时间,才学会跑起来.My Ogre,动起来. 第一,还是要把框架搭起来,这里我们用到双端队列deque,前面已经简单介绍过,头文件如下: #include "ExampleAppl ...

  6. [python][django学习篇][12]继续设计博客首页,点击博客标题能显示文章的详情

    回顾一下开发流程:配置url, 编写视图函数,编写对应模板 配置URL 首页视图匹配的 URL 去掉域名后,是一个空的字符串.每篇文章的详情有着不同的 URL,因此可以设计文章详情页面URl:< ...

  7. Python之threading多线程

    1.threading模块是Python里面常用的线程模块,多线程处理任务对于提升效率非常重要,先说一下线程和进程的各种区别,如图 概括起来就是 IO密集型(不用CPU) 多线程计算密集型(用CPU) ...

  8. [CQOI2016][bzoj4519] 不同的最小割 [最小割树]

    题面 传送门 思路 首先我们明确一点:这道题不是让你把$n^2$个最小割跑一遍[废话] 但是最小割过程是必要的,因为最小割并没有别的效率更高的算法(Stoer-Wagner之类的?) 那我们就要尽量找 ...

  9. 笔记 docker入门笔记

    安装sudo apt-get remove docker docker-engine docker-ce docker.iosudo apt-get updatesudo apt-get instal ...

  10. linux下头文件

    aio.h 异步I/Oassert.h 验证程序断言complex 复数类complex.h 复数处理cpio.h cpio归档值ctype.h 字符类型dirent.h 目录项,opendir(), ...