Handling Missing Values

1) A Simple Option: Drop Columns with Missing Values

如果这些列具有有用信息（在未丢失的位置），则在删除列时，模型将失去对此信息的访问权限。此外，如果您的测试数据在您的训练数据没有的地方缺少值，则会导致错误。

data_without_missing_values = original_data.dropna(axis=1)
 
#同时操作tran和test部分
cols_with_missing = [col for col in original_data.columns
                                 if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)

2) A Better Option: Imputation

默认行为填写了插补的平均值。统计学家已经研究了更复杂的策略，但是一旦将结果插入复杂的机器学习模型，那些复杂的策略通常没有任何好处。

关于Imputation的一个（很多）好处是它可以包含在scikit-learn Pipeline中。管道简化了模型构建，模型验证和模型部署。

from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)

3) An Extension To Imputation

估算是标准方法，通常效果很好。但是，估算值可能系统地高于或低于其实际值（未在数据集中收集）。或者具有缺失值的行可能以某种其他方式看来是唯一的。在这种情况下，您的模型会通过考虑最初缺少哪些值来做出更好的预测。

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()
 
# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()
 
# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

Example (Comparing All Solutions)

import pandas as pd
 
# Load data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
 
melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)
 
# For the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,
                                                    melb_target,
                                                    train_size=0.7,
                                                    test_size=0.3,
                                                    random_state=0)
 
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)
 
# Get Model Score from Dropping Columns with Missing Values
# 直接丢弃含有缺失值的列
cols_with_missing = [col for col in X_train.columns
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
 
# Get Model Score from Imputation
# 插入值
from sklearn.impute import SimpleImputer
 
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
 
# Get Score from Imputation with Extra Columns Showing What Was Imputed
# 添加额外列显示缺失值
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
 
cols_with_missing = (col for col in X_train.columns
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
 
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
 
print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Handling Missing Values的更多相关文章

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
[sklearn] 官方例程－Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
Multi-batch TMT reveals false positives, batch effects and missing values（解读人：胡丹丹）
文献名:Multi-batch TMT reveals false positives, batch effects and missing values (多批次TMT定量方法中对假阳性率,批次效应 ...
缺失值处理（Missing Values）
什么是缺失值?缺失值指数据集中某些变量的值有缺少的情况,缺失值也被称为NA(not available)值.在pandas里使用浮点值NaN(Not a Number)表示浮点数和非浮点数组中的缺失值 ...
Web Scraping with R: How to Fill Missing Value (爬虫：如何处理缺失值)
网络上有大量的信息与数据.我们可以利用爬虫技术来获取这些巨大的数据资源. 这次用 IMDb 网站的2018年100部最欢迎的电影来练练手,顺便总结一下 R 爬虫的方法. >> Prepa ...
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON ...
Kaggle：Home Credit Default Risk 特征工程构建及可视化（2）
博主在之前的博客 Kaggle:Home Credit Default Risk 数据探索及可视化(1) 中介绍了 Home Credit Default Risk 竞赛中一个优秀 kernel 关于 ...
【转】The most comprehensive Data Science learning plan for 2017
I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...
data cleaning
Cleaning data in Python Table of Contents Set up environments Data analysis packages in Python Cle ...

随机推荐

CentOS源码安装Wireshark
(2019年2月19日注:这篇文章原先发在自己github那边的博客,时间是2016年8月25日) Wireshark为网络管理员常用的一个网络管理工具,通过使用这个软件,我们可以对本机网卡上的经过的 ...
Django之框架简介
了解即可: 1.MVC,全名是Model View Controller,是软件工程中的一种软件架构模式,把软件系统分为三个基本部分:模型(Model).视图(View)和控制器(Controller ...
ES6之主要知识点（七）对象
1.属性的简洁表示法 ES6 允许直接写入变量和函数,作为对象的属性和方法.这样的书写更加简洁. function f(x, y) { return {x, y}; } // 等同于 function ...
ES6之字符串学习
以下是常用的方法不是全部方法 1.codePointAt()方法有一些字段需要4个字节储存,这样charCodeAt方法的返回就是不正确的,用codePointAt()方法就可以返回十进制的值.如 ...
Redis安装过程：
jiba中文分词原理
中文分词就是将一个汉字序列分成一个一个单独的词. 现有的分词算法有三大类: 基于字符串匹配的分词:机械分词方法,它是按照一定的策略将待分析的字符串与一个充分大的机器词典中的词条进行匹配,若在词典中找到 ...
kill 3000
杀3000端口,是作为一个web未开发人员经常遇到的事情所以我今天就分享一下我的杀3000端口秘诀 lsof -i: 先要找到端口 node zcool 20u IPv6 0xdddbb4f6f12 ...
工控安全入门（三）—— 再解S7comm
之前的文章我们都是在ctf的基础上学习工控协议知识的,显然这样对于S7comm的认识还不够深刻,这次就做一个实战补全,看看S7comm还有哪些值得我们深挖的地方. 本篇是对S7comm的补全和实战,阅 ...
pycharm 安装与激活
---恢复内容开始--- 环境:Windows 专业版 1.下载安装 1.到官网下载专业版(专业版功能更全(但要激活码),社区版免费) 2.下载完后双击.exe 文件进行安装 NEXT 下一步 NEX ...
N!中素因子p的个数【数论】
求N!中素因子p的个数,也就是N!中p的幂次公式为:cnt=[n/p]+[n/p^2]+[n/p^3]+...+[n/p^k]; 例如:N=12,p=2 12/2=6,表示1~12中有6个数是2的倍 ...