[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset
The Dataset was acquired from https://www.kaggle.com/c/titanic
For data preprocessing, I firstly defined three transformers:
- DataFrameSelector: Select features to handle.
- CombinedAttributesAdder: Add a categorical feature Age_cat which divided all passengers into three catagories according to their ages.
- ImputeMostFrequent: Since the SimpleImputer( ) method was only suitable for numerical variables, I wrote an transformer to impute string missing values with the mode value. Here I was inspired by https://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn.
Then I wrote pipelines separately for different features
- For numerical features, I applied DataFrameSelector, SimpleImputer and StandardScaler
- For categorical features, I applied DataFrameSelector, ImputeMostFrequent and OneHotEncoder
- For the new created feature Age_cat, since itself was a category but was derived from a numerical feature, I wrote an individual pipeline to impute the missing values and encode the categories.
Finally, we can build a full pipeline through FeatureUnion. Here is the code:
# Read data
import pandas as pd
import numpy as np
import os
titanic_train = pd.read_csv('Dataset/Titanic/train.csv')
titanic_test = pd.read_csv('Dataset/Titanic/test.csv')
submission = pd.read_csv('Dataset/Titanic/gender_submission.csv') # Divide attributes and labels
titanic_labels = titanic_train['Survived'].copy()
titanic = titanic_train.drop(['Survived'],axis=1) # Feature Selection
from sklearn.base import BaseEstimator, TransformerMixin class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self,attribute_name):
self.attribute_name = attribute_name
def fit(self, X):
return self
def transform (self, X, y=None):
if 'Pclass' in self.attribute_name:
X['Pclass'] = X['Pclass'].astype(str)
return X[self.attribute_name] # Feature Creation
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
Age_cat = pd.cut(X['Age'],[0,18,60,100],labels=['child', 'adult', 'old'])
Age_cat=np.array(Age_cat)
return pd.DataFrame(Age_cat,columns=['Age_Cat']) # Impute Categorical variables
class ImputeMostFrequent(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0] for c in X],index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill) #Pipeline
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion num_pipeline = Pipeline([
('selector',DataFrameSelector(['Age','SibSp','Parch','Fare'])),
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
]) cat_pipeline = Pipeline([
('selector',DataFrameSelector(['Pclass','Sex','Embarked'])),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) new_pipeline = Pipeline([
('selector',DataFrameSelector(['Age'])),
#('imputer', SimpleImputer(strategy="median")),
('attr_adder',CombinedAttributesAdder()),
('imputer',ImputeMostFrequent()),
('encoder', OneHotEncoder()),
]) full_pipeline = FeatureUnion([
("num", num_pipeline),
("cat", cat_pipeline),
("new", new_pipeline),
]) titanic_prepared = full_pipeline.fit_transform(titanic)
Another thing I want to mention is that the output of a pipeline should be a 2D array rather a 1D array. So if you wanna choose only one feature, don't forget to transform the 1D array by reshape()
method. Otherwise, you will receive an error like
ValueError: Expected 2D array, got 1D array instead
Specifically, apply reshape(-1,1) for column and reshape(1,-1). More about the issue can be found at https://stackoverflow.com/questions/51150153/valueerror-expected-2d-array-got-1d-array-instead.
[Machine Learning with Python] My First Data Preprocessing Pipeline with Titanic Dataset的更多相关文章
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- 《Learning scikit-learn Machine Learning in Python》chapter1
前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
- [Machine Learning with Python] Familiar with Your Data
Here I list some useful functions in Python to get familiar with your data. As an example, we load a ...
- [Machine Learning with Python] How to get your data?
Using Pandas Library The simplest way is to read data from .csv files and store it as a data frame o ...
随机推荐
- 水题:HDU-1088-Write a simple HTML Browser(模拟题)
解题心得: 1.仔细读题,细心细心...... 2.题的几个要求:超过八十个字符换一行,<br>换行,<hr>打印一个分割线,最后打印一个新的空行.主要是输出要求比较多. 3. ...
- 配置Wampserver和安装thinksns
一.先安装Wampserver(去官网下载) 二.安装好后单击wampserver图标,Apache->Service->测试80端口,如果显示: i 端口被iis占用 控制面板-> ...
- proget Android代码混淆
混淆的时候,还要添加Android.jar,不然,你的程序一篇空白.我就吃了亏. 还有,activity是不能混淆的,因为AndroidMeaxinfast.xml里面会找他.
- git+jenkins持续集成一:git上传代码
先注册一个账号,注册地址:https://github.com/ 记住地址 下载git本地客户端,下载地址:https://git-scm.com/download/win 一路next傻瓜安装,加入 ...
- Bit Operation妙解算法题
5道巧妙位操作的算法题. ***第一道*** 题目描述 给定一个非空整数数组,除了某个元素只出现一次以外,其余每个元素均出现两次.找出那个只出现了一次的元素. 说明: 你的算法应该具有线性时间复杂度. ...
- HDU 4514 湫湫系列故事——设计风景线(并查集+树形DP)
湫湫系列故事——设计风景线 Time Limit: 6000/3000 MS (Java/Others) Memory Limit: 65535/32768 K (Java/Others) To ...
- 【POJ2774】Long Long Message (SA)
最长公共子串...两个字符串连在一起,中间放一个特殊字符隔开.求出height之后,枚举height,看两个后缀是不是分布于两段字符串..如果是,这个值就可以作为答案.取最大值即可. ; var c, ...
- elasticsearch备份与恢复4_使用ES-Hadoop将ES中的索引数据写入HDFS中
背景知识见链接:elasticsearch备份与恢复3_使用ES-Hadoop将HDFS数据写入Elasticsearch中 项目参考<Elasticsearch集成Hadoop最佳实践> ...
- imx6 PCIE使能加载ath9k无线网卡
imx6q配置pcie无线网卡遇到如下问题: imx6q-pcie 1ffc000.pcie: PCI host bridge to bus 0000:00 pci_bus 0000:00: root ...
- 【CF1015D】Walking Between Houses(构造,贪心)
题意:从1开始走,最多走到n,走k步,总长度为n,不能停留在原地,不能走出1-n,问是否有一组方案,若有则输出 n<=1e9,k<=2e5,s<=1e18 思路:无解的情况分为两种: ...