[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

The Dataset was acquired from https://www.kaggle.com/c/titanic

For data preprocessing, I firstly defined three transformers:

DataFrameSelector: Select features to handle.
CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.

Then I wrote pipelines separately for different features

For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.

Finally, we can build a full pipeline through FeatureUnion. Here is the code:

 # Read data

 import pandas as pd

 import numpy as np

 import os

 titanic_train = pd.read_csv('Dataset/Titanic/train.csv')

 titanic_test = pd.read_csv('Dataset/Titanic/test.csv')

 submission = pd.read_csv('Dataset/Titanic/gender_submission.csv')

 # Divide attributes and labels

 titanic_labels = titanic_train['Survived'].copy()

 titanic = titanic_train.drop(['Survived'],axis=1)

 # Feature Selection

 from sklearn.base import BaseEstimator, TransformerMixin

 class DataFrameSelector(BaseEstimator, TransformerMixin):

     def __init__(self,attribute_name):

         self.attribute_name = attribute_name

     def fit(self, X):

         return self

     def transform (self, X, y=None):

         if 'Pclass' in self.attribute_name:

             X['Pclass'] = X['Pclass'].astype(str)

         return X[self.attribute_name]

 # Feature Creation

 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):

     def fit(self, X, y=None):

         return self  # nothing else to do

     def transform(self, X, y=None):

         Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old'])

         Age_cat=np.array(Age_cat)

         return pd.DataFrame(Age_cat,columns=['Age_Cat'])

 # Impute Categorical variables

 class ImputeMostFrequent(BaseEstimator, TransformerMixin):

     def fit(self, X, y=None):

         self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns)

         return self

     def transform(self, X, y=None):

         return X.fillna(self.fill)

 #Pipeline

 from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+

 from sklearn.pipeline import Pipeline

 from sklearn.preprocessing import StandardScaler

 from sklearn.preprocessing import OneHotEncoder

 from sklearn.pipeline import FeatureUnion

 num_pipeline = Pipeline([

     ('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])),

     ('imputer', SimpleImputer(strategy="median")),

     ('std_scaler', StandardScaler()),

 ])

 cat_pipeline = Pipeline([

     ('selector',DataFrameSelector(['Pclass','Sex','Embarked'])),

     ('imputer',ImputeMostFrequent()),

     ('encoder', OneHotEncoder()),

 ])

 new_pipeline = Pipeline([

     ('selector',DataFrameSelector(['Age'])),

     #('imputer', SimpleImputer(strategy="median")),

     ('attr_adder',CombinedAttributesAdder()),

     ('imputer',ImputeMostFrequent()),

     ('encoder', OneHotEncoder()),

 ])

 full_pipeline = FeatureUnion([

     ("num", num_pipeline),

     ("cat", cat_pipeline),

     ("new", new_pipeline),

 ])

 titanic_prepared = full_pipeline.fit_transform(titanic)

Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape() method. Otherwise, you will receive an error like

ValueError: Expected 2D array, got 1D array instead

Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.

[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset的更多相关文章

Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
《Learning scikit-learn Machine Learning in Python》chapter1
前言由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
【Machine Learning】Python开发工具：Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
[Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
[Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
[Machine Learning with Python] How to get your data?
Using Pandas Library The simplest way is to read data from .csv files and store it as a data frame o ...

随机推荐

easyui tree datagrid动态添加表头和表格数据，动态弹出框，修改和删除按钮
1.要有获取表头的URL和表格的URL 背景:点击树的一个节点,就加载一个表格,这个表格是动态的,表头和表格数据都是动态的解决方案:需要两个URL,一个是获取表头的URL,一个是获取表格数据的URL ...
Java并发之(1)：volatile关键字（TIJ21-21.3.3 21.3.4)
Java并发Java服务器端编程的一项必备技能. ** 1 简介 volatile是java中的一个保留关键字,它在英语中的含义是易变的,不稳定的.volatile像final.static等其 ...
设计模式之第6章-迭代器模式(Java实现)
设计模式之第6章-迭代器模式(Java实现) “我已经过时了,就不要讲了吧,现在java自带有迭代器,还有什么好讲的呢?”“虽然已经有了,但是具体细节呢?知道实现机理岂不美哉?”“好吧好吧.”(迭代器 ...
Docker与CTF
Docker与CTF 主要是用来搭建环境,漏洞环境,CTF比赛题目复现. docker你可以把它理解为一个vmware. iamges:vmware需要的iso镜像 container:vmware运 ...
Leetcode 543.二叉树的直径
二叉树的直径给定一棵二叉树,你需要计算它的直径长度.一棵二叉树的直径长度是任意两个结点路径长度中的最大值.这条路径可能穿过根结点. 示例 :给定二叉树 1 / \ 2 3 / \ 4 5 返回 3, ...
html编码和解码
public static string EncodeStr(string str) { str = Regex.Replace(str, @"<html[^>]*?>.* ...
[oldboy-django][1初识django]阻止默认事件发生 + ajax + 模态编辑对话框
阻止默认事件发生 a 阻止a标签默认事件发生方法 <a href="http://www.baidu.com" onclick="modalEdit();" ...
Log4j官方文档翻译(八、文件输出)
使用org.apache.log4j.FileAppender可以把日志写到文件中: FileAppender配置 immediateFlush 这个标志默认为true,是否每次有消息产生都自动flu ...
P4302 [SCOI2003]字符串折叠
题目描述折叠的定义如下: 一个字符串可以看成它自身的折叠.记作S = S X(S)是X(X>1)个S连接在一起的串的折叠.记作X(S) = SSSS…S(X个S). 如果A = A’, B = ...
ionic2 jpush
ionic2 为ionic2调用极光插件提供符合angular2及TS的调用方式 install 先安装官方的cordova插件 $ cordova plugin add jpush-phonegap ...

[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset

[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset的更多相关文章

随机推荐

热门专题