The Dataset was acquired from https://www.kaggle.com/c/titanic

For data preprocessing, I firstly defined three transformers:

  • DataFrameSelector: Select features to handle.
  • CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
  • ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.

Then I wrote pipelines separately for different features

  • For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
  • For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
  • For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.

Finally, we can build a full pipeline through FeatureUnion. Here is the code:

 # Read data
import pandas as pd
import numpy as np
import os
titanic_train = pd.read_csv('Dataset/Titanic/train.csv')
titanic_test = pd.read_csv('Dataset/Titanic/test.csv')
submission = pd.read_csv('Dataset/Titanic/gender_submission.csv') # Divide attributes and labels
titanic_labels = titanic_train['Survived'].copy()
titanic = titanic_train.drop(['Survived'],axis=1) # Feature Selection
from sklearn.base import BaseEstimator, TransformerMixin class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self,attribute_name):
self.attribute_name = attribute_name
def fit(self, X):
return self
def transform (self, X, y=None):
if 'Pclass' in self.attribute_name:
X['Pclass'] = X['Pclass'].astype(str)
return X[self.attribute_name] # Feature Creation
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old'])
Age_cat=np.array(Age_cat)
return pd.DataFrame(Age_cat,columns=['Age_Cat']) # Impute Categorical variables
class ImputeMostFrequent(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill) #Pipeline
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion num_pipeline = Pipeline([
('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])),
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
]) cat_pipeline = Pipeline([
('selector',DataFrameSelector(['Pclass','Sex','Embarked'])),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) new_pipeline = Pipeline([
('selector',DataFrameSelector(['Age'])),
#('imputer', SimpleImputer(strategy="median")),
('attr_adder',CombinedAttributesAdder()),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) full_pipeline = FeatureUnion([
("num", num_pipeline),
("cat", cat_pipeline),
("new", new_pipeline),
]) titanic_prepared = full_pipeline.fit_transform(titanic)

Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape() method. Otherwise, you will receive an error like

ValueError: Expected 2D array, got 1D array instead

Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.


												

[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset的更多相关文章

  1. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  2. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  5. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

  6. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  7. [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

    In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...

  8. [Machine Learning with Python] Familiar with Your Data

    Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...

  9. [Machine Learning with Python] How to get your data?

    Using Pandas Library The simplest way is to read data from .csv files and store it as a data frame o ...

随机推荐

  1. laravel5.2总结--响应

      1 基本响应 1.1 返回一个字符串,指定的字符串会被框架自动转换成 HTTP 响应. Route::get('/', function () { return 'Hello World'; }) ...

  2. netcfg.exe

    netcfg.exe 编辑 本词条缺少信息栏.名片图,补充相关内容使词条更完整,还能快速升级,赶紧来编辑吧!   目录 1 简介 2 可能出现问题 简介编辑 netcfg.exe是Kaspersky的 ...

  3. JS的跨域理解

    前言 周一的学院点开题被批的很惨,换了个校长,各种被抓严,班上已经有两个同学打算休学了.哎,这周的聚会可能是大家集聚的最后一次吧.熬着吧,还是学习我的前端,不管老板学校咋逼了,找个好工作才是王道.今天 ...

  4. Java并发之(3):锁

    锁是并发编程中的重要概念,用来控制多个线程对同一资源的并发访问,在支持并发的编程语言中都有体现,比如c++ python等.本文主要讲解Java中的锁,或者说是重入锁.之所以这么说是因为在Java中, ...

  5. Asp.net自定义控件开发任我行(5)-嵌入资源上

    摘要 上一篇我们讲了VitwState保存控件状态,此章我们来讲讲嵌入css文件,js文件,嵌入Image文件我也一笔带过. 内容 随着我的控件的完善,我们目标控件DropDwonCheckList最 ...

  6. 43、gridview或者listview的adapter优化

    1.在getview时,如果是一个textview,那么不用每次都new一个或者inflater直接返回,可以先判断convertview是否为空,如果为空则new或者inflate,否则直接返回co ...

  7. Django-缓存机制、跨域请求(CORS)、ContentType组件

    Django缓存机制: 在settings中间件里面设置: 三个粒度: 1 全站缓存 用中间件: MIDDLEWARE = [ # 'django.middleware.cache.UpdateCac ...

  8. [译]pandas .at 和.loc速度对比

    df.at 一次只能访问一个值. df.loc能够选取多行多列. In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')] 10000 loops, best o ...

  9. 查找docker log久远数据方法

    问题描述: 同事发现几天前运行的一个文件id存在错误,需要查看docker log,但是使用docker logs -f container_id 上下翻很耗费时间. 解决思路: 每条对应的log都会 ...

  10. iOS自定义控件创建原理(持续更新)

    前言 因为如果要创建各种自定义控件根据需求的不同会有很多的差别,所以我就在这里,分析一些自定义控件的创建实现方法 弹出视图 1.把要弹出的视图装在一个控制器里面,自定义转场动画 2.创建一个弹出视图, ...