In this article, we dicuss some main steps in data preparation.

Drop Labels

Firstly, we drop labels for train set. Here we use drop() method in Pandas library.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

Here are some tips:

  • The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
  • The drop function doesn't change the DataFrame by default.  And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True.
  • Note the function copy() here. It creates a copy that will not affect the original DataFrame

Impute Missing Values

Firstly, let's check the missing values:

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Here give three methods to impute missing values:

Option 1: drop the rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Option 2: drop the columns

sample_incomplete_rows.drop("total_bedrooms", axis=1) 

Option 3: impute with the median value

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.

 try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer imputer = SimpleImputer(strategy="median")
# Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy

Finally, transform the train set:

 X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

Encode Categorical Attributes

We need to convert text labels to numbers. There are two methods.

Option 1: Label Encoding

Conver a categorical attribute into an interger attribute.

 try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

Option2: One-Hot Encoding

Convert a categorical attribute into a series of binary intergers.

 try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:

housing_cat_1hot.toarray()

Alternatively, you can set sparse=False when creating the OneHotEncoder:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Engineering

Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.

 from sklearn.base import BaseEstimator, TransformerMixin

 # column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章

  1. [Machine Learning with Python] Data Preparation through Transformation Pipeline

    In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...

  2. [Machine Learning with Python] Data Visualization by Matplotlib Library

    Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...

  3. Python (1) - 7 Steps to Mastering Machine Learning With Python

    Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...

  4. Getting started with machine learning in Python

    Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...

  5. 《Learning scikit-learn Machine Learning in Python》chapter1

    前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...

  6. 【Machine Learning】Python开发工具:Anaconda+Sublime

    Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...

  7. In machine learning, is more data always better than better algorithms?

    In machine learning, is more data always better than better algorithms? No. There are times when mor ...

  8. Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)

    Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...

  9. Machine Learning的Python环境设置

    Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...

随机推荐

  1. Altium Designer

    抗干扰设计原则: 1.电源线的设计 选择合适的电源 尽量加宽电源线 保证电源线.底线走向和数据传输方向一致 使用抗干扰元器件(磁珠.电源滤波器等) 电源入口添加去耦电容 2.底线的设计 模拟地和数字地 ...

  2. Python中str、list、numpy分片操作

    在Python里,像字符串(str).列表(list).元组(tupple)和这类序列类型都支持切片操作 对对象切片,s是一个字符串,可以通过类似数组索引的方式获取字符串中的字符,同时也可以用s[a: ...

  3. xgboost原理总结和代码展示

    关于xgboost的学习推荐两篇博客,每篇看2遍,我都能看懂,你肯定没问题 两篇方法互通,知识点互补!记录下来,方便以后查看 第一篇:作者:milter链接:https://www.jianshu.c ...

  4. 获取获取docker的文件

    1.docke实例内mysql 导出文件 mysql -h yourhost -P yourport -u user -p dbname -e "select * from employee ...

  5. Ntdsutil.exe

    Ntdsutil.exe 编辑 本词条缺少名片图,补充相关内容使词条更完整,还能快速升级,赶紧来编辑吧! Ntdsutil.exe 是一个为 Active Directory 提供管理设施的命令行工具 ...

  6. 【网易严选】iOS持续集成打包(Jenkins+fastlane+nginx)

    本文来自网易云社区 作者:孙娇 严选iOS客户端的现有打包方式是通过远程连接打包机执行脚本去打包,打完包会输出相应的ipa的二维码,扫一扫二维码可以安装,但是随着测试队伍的壮大,外包同学越来越多,在打 ...

  7. 【Remove Nth Node From End of List】cpp

    题目: Given a linked list, remove the nth node from the end of list and return its head. For example, ...

  8. python-day3-内置函数与字符字节之间的转换

    #三元运算 1 if True else 0 >>>1 1 if False else 0 >>>0 #内置函数lambda def f1(a1): return ...

  9. 微信小程序--问题汇总及详解之tab切换

    设置背景颜色就直接在page里设置    page {background-color: rgb(242, 242, 242);} tab切换: navigator 页面链接 传参的格式为url=&q ...

  10. sql cte的使用

    cte是可以连续使用的,多个cte用逗号隔开,但是只能有一个with 百度文章标题:Sql server中使用with as 提高性能+高效分页