In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

  • The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
  • The drop function doesn't change the DataFrame by default.  And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
  • Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1) 

Option 3: impute with the median value

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer imputer = SimpleImputer(strategy="median")
# Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

  1. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  5. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  6. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  7. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  8. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. hdu1950Bridging signals(求最长上升自序列nlogn算法)

    Bridging signals Time Limit: 5000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) ...

  2. 创建 PSO

    TechNet 库 Windows Server Windows Server 2008 R2 und Windows Server 2008 浏览 Windows Server 技术 Active ...

  3. gulp基本入门

    gulp用于自动化和提高工作流,类似于grunt.gulp适用于nodejs平台.   gulp基础: gulp两个主要的功能是读取想要处理的文件,把处理好的文件放到指定的地方gulp.src()找出 ...

  4. Wordpress Uncaught TypeError: b(...).not(...).filter(...).mediaelementplayer is not a function

    Wordpress 插件页面报错如下图: 原因及解决方法: 引入了两次 jquery.js 或 jquery.js 定义的变量导致报错,删除在插件页面自己引入的 jquery,js 即可解决报错. 大 ...

  5. 13 Java内存模型

    数据竞争 int a=0, b=0; public void method1() { int r2 = a; b = 1; } public void method2() { int r1 = b; ...

  6. 在windows64位上安装Python3.0

    1.下载安装包 下载地址:https://www.python.org/downloads/ 如果要下载帮助文件:Download Windows help file 如果要下载基于网页的安装程序: ...

  7. Leetcode 517.超级洗衣机

    超级洗衣机 假设有 n 台超级洗衣机放在同一排上.开始的时候,每台洗衣机内可能有一定量的衣服,也可能是空的. 在每一步操作中,你可以选择任意 m (1 ≤ m ≤ n) 台洗衣机,与此同时将每台洗衣机 ...

  8. [oldboy-django][2深入django]django一个请求的生命周期 + WSGI + 中间件

    1 WSGI # WSGI(是一套协议,很多东西比如wsgiref, uwsgiref遵循这一套协议) - django系统本质 别人的socket(wsgiref或者uwsgiref) + djan ...

  9. C#知识点<2>

    1. ? : 运算符(真2假3) 我们已经在前面的章节中讲解了 条件运算符 ? :,可以用来替代 if...else 语句.它的一般形式如下: Exp1 ? Exp2 : Exp3; 其中,Exp1. ...

  10. hihoCoder #1661 数组区间

    题目大意 给出 $1$ 到 $n$ 的一个排列($n\le 10^5$),记做 $a_1, a_2, \dots, a_n$ .(注:原题面表述为:"给定 $n$ 个互不相同且不超过 $n$ ...