In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a series of steps in data preparation. Scikit-Learn provides the Pipeline class to help with such sequences of transformations.

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like.

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.

The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler, which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence (it also has a fit_transform method that we could have used instead of calling fit() and then transform()).

 from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
]) try:
from sklearn.compose import ColumnTransformer
except ImportError:
from future_encoders import ColumnTransformer # Scikit-Learn < 0.20 num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
]) housing_prepared = full_pipeline.fit_transform(housing)

[Machine Learning with Python] Data Preparation through Transformation Pipeline的更多相关文章

  1. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  5. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  6. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  7. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  8. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. [BZOJ1588]营业额统计(Splay)

    Description 题意:给定 n个数,每给定一个数,在之前的数里找一个与当前数相差最小的数,求相差之和(第一个数为它本身) 如:5 1 2 5 4 6 Ans=5+|1-5|+|2-1|+|5- ...

  2. PAT Basic 1083

    1083 是否存在相等的差 给定 N 张卡片,正面分别写上 1.2.…….N,然后全部翻面,洗牌,在背面分别写上 1.2.…….N.将每张牌的正反两面数字相减(大减小),得到 N 个非负差值,其中是否 ...

  3. Diycode开源项目 LoginActivity分析

    1.首先看一下效果 1.1.预览一下真实页面 1.2.分析一下: 要求输入Email或者用户名,点击编辑框,弹出键盘,默认先进入输入Email或用户名编辑框. 点击密码后,密码字样网上浮动一段距离,E ...

  4. Python基础闯关失败总结

    对列表进行创建切片增删改查 对列表进行创建 L1 = []  # 定义L1 为一个空列表 List() #创建List 空列表 对列表进行查询 L2 = ['a','b','c','d','a','e ...

  5. UVa 1455 Kingdom 线段树 并查集

    题意: 平面上有\(n\)个点,有一种操作和一种查询: \(road \, A \, B\):在\(a\),\(b\)两点之间加一条边 \(line C\):询问直线\(y=C\)经过的连通分量的个数 ...

  6. win8 远程桌面时提示凭证不工作问题的终极解决办法

    环境说明 远程办公电脑(放置于公司.自用办公电脑.win8系统) 远程连接客户机(放置于家中.家庭日常所用.win8系统) 故障现象 最近在使用远程桌面连接公司的办公电脑时,突然发现win8系统总是无 ...

  7. Hyper-V:利用差异磁盘安装多个Win2008

    签于成本的原因,在学习了解一项新的技术或是产品时,在没有部署到生产环境之中前,大家都会选择在虚拟机来搭建一套实验环境.但如何快速搭建呢?如何节省磁盘空间呢? 说到此都不得不说下Hyper-V的差异磁盘 ...

  8. Python之基于socket和select模块实现IO多路复用

    '''IO指的是输入输出,一部分指的是文件操作,还有一部分网络传输操作,例如soekct就是其中之一:多路复用指的是利用一种机制,同时使用多个IO,例如同时监听多个文件句柄(socket对象一旦传送或 ...

  9. 学习成绩>=90分的同学用A表示,60-89分之间的用B表示,60分以下的用利用条件运算符的嵌套来完成此题:C表示。

    # -*- coding: utf8 -*- # Author:wxq #python 2.7 #题目:学习成绩>=90分的同学用A表示,60-89分之间的用B表示,60分以下的用利用条件运算符 ...

  10. 课堂笔记III