In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

  • The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
  • The drop function doesn't change the DataFrame by default.  And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
  • Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1) 

Option 3: impute with the median value

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer imputer = SimpleImputer(strategy="median")
# Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

  1. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  5. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  6. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  7. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  8. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. python之返回状态commands模块

    需要得到命令执行的状态则需要判断$?的值, 在Python中有一个模块commands很容易做到以上的效果. commands.getstatusoutput(cmd)  返回一个元组(status, ...

  2. Kali 网络配置

    一.配置IP 编辑/etc/network/interfaces # This file describes the network interfaces available on your syst ...

  3. MFC DLL 可以封装MFC的窗体 供别的MFC程序使用

    MFC DLL  可以封装MFC的窗体 供别的MFC程序使用 在庞大程序分工里面 非常可取. 可以细分每个窗体就是单独的 模块. [后续不断完善]

  4. 基于web自动化测试框架的设计与开发(本科论文word)

  5. 基于web自动化测试框架的设计与开发(讲解演示PPT)

  6. Python-S9-Day114——Flask开始实战

    01 今日内容概要 02 课前分享 03 内容回顾 04 路飞学城表结构(一) 05 路飞学城表结构(二) 06 路飞学城立即支付思路 07 今日作业 08 初识Flask 09 werkzug 10 ...

  7. log4net实现多实例记录

    原文地址:实现多个LOG4NET日志记录器实例 本文内容为摘抄,请查看原文. 对于.NET Framework开发者来说,使用Log4Net进行日志记录是非常方便的,通常只要写好配置文件和简单的编码就 ...

  8. nyoj 题目19 擅长排列的小明

    擅长排列的小明 时间限制:1000 ms  |  内存限制:65535 KB 难度:4   描述 小明十分聪明,而且十分擅长排列计算.比如给小明一个数字5,他能立刻给出1-5按字典序的全排列,如果你想 ...

  9. 目前问题:plupload上传带参数到后台

    目前问题:plupload上传带参数到后台,迟迟没有解决!!! 昨晚到23点多终于完成了! 直接上代码! var uploader = new plupload.Uploader({ //实例化一个p ...

  10. Threadlocal_笔记

    参考:https://www.jianshu.com/p/377bb840802f https://www.cnblogs.com/dreamroute/p/5034726.html ThreadLo ...