In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set

housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1)

Option 3: impute with the median value

median = housing["total_bedrooms"].median()

sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:

     from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+

 except ImportError:

     from sklearn.preprocessing import Imputer as SimpleImputer

 imputer = SimpleImputer(strategy="median")

 # Remove the text attribute because median can only be calculated on numerical attributes

 housing_num = housing.drop('ocean_proximity', axis=1)

 # alternatively: housing_num = housing.select_dtypes(include=[np.number])

 imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)

 housing_tr = pd.DataFrame(X, columns=housing_num.columns,

                           index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:

     from sklearn.preprocessing import OrdinalEncoder

 except ImportError:

     from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20

 ordinal_encoder = OrdinalEncoder()

 housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:

     from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20

     from sklearn.preprocessing import OneHotEncoder

 except ImportError:

     from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

 cat_encoder = OneHotEncoder()

 housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)

housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index

 rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):

     def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs

         self.add_bedrooms_per_room = add_bedrooms_per_room

     def fit(self, X, y=None):

         return self  # nothing else to do

     def transform(self, X, y=None):

         rooms_per_household = X[:, rooms_ix] / X[:, household_ix]

         population_per_household = X[:, population_ix] / X[:, household_ix]

         if self.add_bedrooms_per_room:

             bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]

             return np.c_[X, rooms_per_household, population_per_household,

                          bedrooms_per_room]

         else:

             return np.c_[X, rooms_per_household, population_per_household]

 attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)

 housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

[Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
[Machine Learning with Python] Data Visualization by Matplotlib Library
Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...
Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
《Learning scikit-learn Machine Learning in Python》chapter1
前言由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
【Machine Learning】Python开发工具：Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
In machine learning, is more data always better than better algorithms?
In machine learning, is more data always better than better algorithms? No. There are times when mor ...
Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)
Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...
Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

【转】Oracle AWR 报告每天自动生成并发送邮箱 Python脚本(一)
Oracle 的AWR 报告能很好的提供有关DB性能的信息. 所以DBA 需要定期的查看AWR的报告. 有关AWR报告的说明参考: Oracle AWR 介绍 http://blog.csdn.net ...
How to check if Visual Studio 2005 SP1 is installed
How to check if Visual Studio 2005 SP1 is installed Check the following registry key. HKEY_LOCAL_MAC ...
laravel5.2总结--数据库操作
1 配置信息 1.1配置目录: config/database.php 1.2配置多个数据库 //默认的数据库 'mysql' => [ 'driver' => 'mysql', 'hos ...
Android TV 开发（4）
本文来自网易云社区作者:孙有军最后我们再来看看好友界面,改界面本地是没有xml的,因此我们直接来看看代码: 这里将使用到数据bean,与数据源的代码也贴出来如下: public class Con ...
Asp.net自定义控件开发任我行（5）-嵌入资源上
摘要上一篇我们讲了VitwState保存控件状态,此章我们来讲讲嵌入css文件,js文件,嵌入Image文件我也一笔带过. 内容随着我的控件的完善,我们目标控件DropDwonCheckList最 ...
iOS下单例模式实现（一）（objective-c arc gcd）
单例模式确保某一个类只有一个实例,而且自行实例化并向整个系统提供这个实例. 这里主要介绍下在arc下,利用gcd实现单例. 第一步:声明一个静态实例 static SoundTool *_instan ...
【LoadRunner】对摘要认证的处理
近期项目中,进行http协议的接口性能测试过程中,需要进行登录接口的摘要认证,分享一下测试经验. 测试准备测试工具:LoadRunner11 测试类型:接口测试--某系统登录接口步骤根据系统接口 ...
python函数之五马分析
Python 函数函数是组织好的,可重复使用的,用来实现单一或相关联功能的代码段. 函数能提高应用的模块性和代码的重复利用率.Python提供了许多内建函数,比如print().也可以自己创建函数, ...
location.replace()和location.href=进行跳转的区别
location.href 通常被用来跳转到指定页面地址;location.replace 方法则可以实现用新的文档替换当前文档;location.replace 方法不会在 history 对象中生 ...
BZOJ 1192：[HNOI2006]鬼谷子的钱袋（数学）
鬼谷子的钱袋Description鬼谷子非常聪明,正因为这样,他非常繁忙,经常有各诸侯车的特派员前来向他咨询时政.有一天,他在咸阳游历的时候,朋友告诉他在咸阳最大的拍卖行(聚宝商行)将要举行一场拍卖会 ...