[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation.
Drop Labels
Firstly, we drop labels for train set. Here we use drop() method in Pandas library.
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
Here are some tips:
- The drop funtion deletes rows by default. If you want to delete columns, don't forget to set the parameter axis=1.
- The
dropfunction doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows/columns removed. Or you can set inplace = True. - Note the function copy() here. It creates a copy that will not affect the original DataFrame
Impute Missing Values
Firstly, let's check the missing values:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
Here give three methods to impute missing values:
Option 1: drop the rows
sample_incomplete_rows.dropna(subset=["total_bedrooms"])
Option 2: drop the columns
sample_incomplete_rows.drop("total_bedrooms", axis=1)
Option 3: impute with the median value
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
Alternatively, we can import sklearn.impute.SimpleImputer class in Scikit-Learn 0.20.
try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer imputer = SimpleImputer(strategy="median")
# Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)
We can check the statistcs by imputer.statistics_ and the strategy by imputer.strategy
Finally, transform the train set:
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))
Encode Categorical Attributes
We need to convert text labels to numbers. There are two methods.
Option 1: Label Encoding
Conver a categorical attribute into an interger attribute.
try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20 ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
Option2: One-Hot Encoding
Convert a categorical attribute into a series of binary intergers.
try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20 cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense array if needed by calling the toarray()method:
housing_cat_1hot.toarray()
Alternatively, you can set sparse=False when creating the OneHotEncoder:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
Feature Engineering
Sometimes, we need to add some features to better describe the variation of the target variable. Let's create a custom transformer to add extra attributes and implement three methods: fit()(returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for auto‐ matic hyperparameter tuning.
from sklearn.base import BaseEstimator, TransformerMixin # column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn的更多相关文章
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Visualization by Matplotlib Library
Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...
- Python (1) - 7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills install Anacondaincluding numpy, scikit-learn, and matplotlib Step 2: Fo ...
- Getting started with machine learning in Python
Getting started with machine learning in Python Machine learning is a field that uses algorithms to ...
- 《Learning scikit-learn Machine Learning in Python》chapter1
前言 由于实验原因,准备入坑 python 机器学习,而 python 机器学习常用的包就是 scikit-learn ,准备先了解一下这个工具.在这里搜了有 scikit-learn 关键字的书,找 ...
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- In machine learning, is more data always better than better algorithms?
In machine learning, is more data always better than better algorithms? No. There are times when mor ...
- Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)
Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...
- Machine Learning的Python环境设置
Machine Learning目前经常使用的语言有Python.R和MATLAB.如果采用Python,需要安装大量的数学相关和Machine Learning的包.一般安装Anaconda,可以把 ...
随机推荐
- python使用@property @x.setter @x.deleter
@property可以将python定义的函数“当做”属性访问,从而提供更加友好访问方式,但是有时候setter/deleter也是需要的. 1>只有@property表示只读. 2>同时 ...
- 【Python】函数参数类型及用法
一.函数的参数类型 def hs(a1,a2,a3,...): ****statements 其中a1,a2,a3是函数的参数,函数的参数类型可分为:必须参数.默认参数.可变参数(不定长参数).关键 ...
- wim
wim 编辑 WIM是英文Microsoft Windows Imaging Format(WIM)的简称,它是Windows基于文件的映像格式.WIM 映像格式并非现在相当常见的基于扇区的映像格式, ...
- 设计模式之第18章-观察者模式(Java实现)
设计模式之第18章-观察者模式(Java实现) 话说曾小贤,也就是陈赫这些天有些火,那么这些明星最怕的,同样最喜欢的是什么呢?没错,就是狗仔队.英文的名字比较有意思,是paparazzo,这一说法据说 ...
- Python+Selenium练习篇之14-获取当前页面的title
前面文章介绍了如何获取当前页面的URL的值,本文介绍如何获取当前页面的title,这个也可以作为测试结果的依据,通过得到的title和预期的值对比,可以支持我们判断页面跳转正确. 相关脚本代码如下: ...
- Python+Selenium练习篇之6-利用class name定位元素
有时候,我们在用firepath(不会的请点这里)查看元素的XPath信息,发现没有可以用来定位的id信息,这个时候我们就需要考虑用其他的可用的来定位元素.本文介绍如何通过元素节点中class nam ...
- Python3.0-3.6的版本变化
Table of Contents Python3.0 简单的变化 语法的变化 新语法 改动的语法 剩下的变化 Python3.1 Python3.2 Python3.3 Python3.4 Pyth ...
- python-day3-内置函数与字符字节之间的转换
#三元运算 1 if True else 0 >>>1 1 if False else 0 >>>0 #内置函数lambda def f1(a1): return ...
- icpc南昌邀请赛 比赛总结
上周末,我参加了icpc南昌区域赛邀请赛,这也是我的第一次外出参赛. 星期五晚上,在6个小时的火车和1个小时的公交后,我们终于抵达了江西师范大学,这次的比赛场地.江西师范大学周围的设施很齐全,各种烧烤 ...
- 用nc+简单bat/vbs脚本+winrar制作迷你远控后门
前言 某大佬某天和我聊起了nc,并且提到了nc正反向shell这个概念. 我对nc之前的了解程度仅局限于:可以侦听TCP/UDP端口,发起对应的连接. 真正的远控还没实践过,所以决定写个小后门试一试. ...