In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set

housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1)

Option 3: impute with the median value

median = housing["total_bedrooms"].median()

sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:

     from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+

 except ImportError:

     from sklearn.preprocessing import Imputer as SimpleImputer

 imputer = SimpleImputer(strategy="median")

 # Remove the text attribute because median can only be calculated on numerical attributes

 housing_num = housing.drop('ocean_proximity', axis=1)

 # alternatively: housing_num = housing.select_dtypes(include=[np.number])

 imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)

 housing_tr = pd.DataFrame(X, columns=housing_num.columns,

                           index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:

     from sklearn.preprocessing import OrdinalEncoder

 except ImportError:

     from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20

 ordinal_encoder = OrdinalEncoder()

 housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:

     from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20

     from sklearn.preprocessing import OneHotEncoder

 except ImportError:

     from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

 cat_encoder = OneHotEncoder()

 housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)

housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index

 rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):

     def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs

         self.add_bedrooms_per_room = add_bedrooms_per_room

     def fit(self, X, y=None):

         return self  # nothing else to do

     def transform(self, X, y=None):

         rooms_per_household = X[:, rooms_ix] / X[:, household_ix]

         population_per_household = X[:, population_ix] / X[:, household_ix]

         if self.add_bedrooms_per_room:

             bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]

             return np.c_[X, rooms_per_household, population_per_household,

                          bedrooms_per_room]

         else:

             return np.c_[X, rooms_per_household, population_per_household]

 attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)

 housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

[Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
[Machine Learning with Python] Data Visualization by Matplotlib Library
Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...
Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
《Learning scikit-learn Machine Learning in Python》chapter1
前言由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
【Machine Learning】Python开发工具：Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
In machine learning, is more data always better than better algorithms?
In machine learning, is more data always better than better algorithms? No. There are times when mor ...
Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)
Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...
Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

mongoTemplate学习笔记
mongoTemplate的andExpression表达式 Aggregation<Post> agg = Aggregation.newAggregation( Record.clas ...
报错： Could not open JDBC Connection for transaction; nested exception is java.sql.SQLException: An attempt by a client to checkout a Connection has timed out. 数据库连接超时
解决方法一: [oracle@data ~]$ sqlplus / as sysdba——连接到数据库 SQL*Plus: Release 11.2.0.4.0 Production on Mon M ...
Robotium之Android控件定位实践和建议
本人之前曾经撰文描述Appium和UIAutomator框架是如何定位Android界面上的控件的. UIAutomator定位Android控件的方法实践和建议Appium基于安卓的各种FindEl ...
python-os模块及md5加密
常用内置方法 __doc__打印注释 __package__打印所在包 __cached__打印字节码 __name__当前为主模块是__name__ == __main__ __file__打印文件 ...
Linux 的软件管理及配置 - 安装、卸载、升级、依赖
1. 对比:Windows 和 Linux 上软件的安装与卸载大部分 Linux 使用者都是从 Windows 转过来的,先对这俩做个对比,有助理解. 就像在 Windows 下,很多软件也有安装版 ...
java io 流输入输出大牛经典总结
在软件开发中,数据流和数据库操作占据了一个很重要的位置,所以,熟悉操作数据流和数据库,对于每一个开发者来说都是很重要的,今天就来总结一下I/O,数据库操作一:从数据流开始首先先有一个结构图看一下整 ...
【bzoj2427】[HAOI2010]软件安装 Tarjan+树形背包dp
题目描述现在我们的手头有N个软件,对于一个软件i,它要占用Wi的磁盘空间,它的价值为Vi.我们希望从中选择一些软件安装到一台磁盘容量为M计算机上,使得这些软件的价值尽可能大(即Vi的和最大).但是现 ...
练级（train）
练级(train) 试题描述 cxm 在迷宫中练级.迷宫可以看成一个有向图,有向图的每个边上都有怪物.通过每条边并消灭怪物需要花费 $1$ 单位时间.消灭一个怪物可以得到一定数量的经验值.怪物被消 ...
bzoj4455【ZJOI2016】小星星
题意:http://www.lydsy.com/JudgeOnline/problem.php?id=4455 给一张图和该图的一棵生成树,求可能的编号方案数 sol :dalao教导我们,看到计数 ...
Fabric和Sawtooth技术分析（上）
https://mp.weixin.qq.com/s?__biz=MjM5MDAxMTE0MA==&mid=2652049866&idx=1&sn=5b4aea961f3d64 ...