Handling Missing Values

1) A Simple Option: Drop Columns with Missing Values

如果这些列具有有用信息（在未丢失的位置），则在删除列时，模型将失去对此信息的访问权限。此外，如果您的测试数据在您的训练数据没有的地方缺少值，则会导致错误。

data_without_missing_values = original_data.dropna(axis=1)

#同时操作tran和test部分

cols_with_missing = [col for col in original_data.columns

                                 if original_data[col].isnull().any()]

redued_original_data = original_data.drop(cols_with_missing, axis=1)

reduced_test_data = test_data.drop(cols_with_missing, axis=1)

2) A Better Option: Imputation

默认行为填写了插补的平均值。统计学家已经研究了更复杂的策略，但是一旦将结果插入复杂的机器学习模型，那些复杂的策略通常没有任何好处。

关于Imputation的一个（很多）好处是它可以包含在scikit-learn Pipeline中。管道简化了模型构建，模型验证和模型部署。

from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()

data_with_imputed_values = my_imputer.fit_transform(original_data)

3) An Extension To Imputation

估算是标准方法，通常效果很好。但是，估算值可能系统地高于或低于其实际值（未在数据集中收集）。或者具有缺失值的行可能以某种其他方式看来是唯一的。在这种情况下，您的模型会通过考虑最初缺少哪些值来做出更好的预测。

# make copy to avoid changing original data (when Imputing)

new_data = original_data.copy()

# make new columns indicating what will be imputed

cols_with_missing = (col for col in new_data.columns

                                 if new_data[col].isnull().any())

for col in cols_with_missing:

    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation

my_imputer = SimpleImputer()

new_data = pd.DataFrame(my_imputer.fit_transform(new_data))

new_data.columns = original_data.columns

Example (Comparing All Solutions)

import pandas as pd

# Load data

melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

melb_target = melb_data.Price

melb_predictors = melb_data.drop(['Price'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors.

melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,

                                                    melb_target,

                                                    train_size=0.7,

                                                    test_size=0.3,

                                                    random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):

    model = RandomForestRegressor()

    model.fit(X_train, y_train)

    preds = model.predict(X_test)

    return mean_absolute_error(y_test, preds)

# Get Model Score from Dropping Columns with Missing Values
# 直接丢弃含有缺失值的列

cols_with_missing = [col for col in X_train.columns

                                 if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_with_missing, axis=1)

reduced_X_test  = X_test.drop(cols_with_missing, axis=1)

print("Mean Absolute Error from dropping columns with Missing Values:")

print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

# Get Model Score from Imputation
# 插入值

from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()

imputed_X_train = my_imputer.fit_transform(X_train)

imputed_X_test = my_imputer.transform(X_test)

print("Mean Absolute Error from Imputation:")

print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

# Get Score from Imputation with Extra Columns Showing What Was Imputed
# 添加额外列显示缺失值

imputed_X_train_plus = X_train.copy()

imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns

                                 if X_train[col].isnull().any())

for col in cols_with_missing:

    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()

    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation

my_imputer = SimpleImputer()

imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)

imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")

print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Handling Missing Values的更多相关文章

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
[sklearn] 官方例程－Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
Multi-batch TMT reveals false positives, batch effects and missing values（解读人：胡丹丹）
文献名:Multi-batch TMT reveals false positives, batch effects and missing values (多批次TMT定量方法中对假阳性率,批次效应 ...
缺失值处理（Missing Values）
什么是缺失值?缺失值指数据集中某些变量的值有缺少的情况,缺失值也被称为NA(not available)值.在pandas里使用浮点值NaN(Not a Number)表示浮点数和非浮点数组中的缺失值 ...
Web Scraping with R: How to Fill Missing Value (爬虫：如何处理缺失值)
网络上有大量的信息与数据.我们可以利用爬虫技术来获取这些巨大的数据资源. 这次用 IMDb 网站的2018年100部最欢迎的电影来练练手,顺便总结一下 R 爬虫的方法. >> Prepa ...
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON ...
Kaggle：Home Credit Default Risk 特征工程构建及可视化（2）
博主在之前的博客 Kaggle:Home Credit Default Risk 数据探索及可视化(1) 中介绍了 Home Credit Default Risk 竞赛中一个优秀 kernel 关于 ...
【转】The most comprehensive Data Science learning plan for 2017
I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...
data cleaning
Cleaning data in Python Table of Contents Set up environments Data analysis packages in Python Cle ...

随机推荐

sip会话流程以及sip介绍（1）
参考连接 :https://www.2cto.com/kf/201609/546336.html https://www.w3cschool.cn/session_initiation_protoco ...
MyBatis - sqlMapConfig.xml主配置文件
SqlMapConfig.xml配置文件的内容和配置顺序如下 ① properties(读取配置文件):定义配置,配置的属性可以在整个配置文件中其他位置进行引用: ② settings(全局配置参数) ...
03_springmvc整合mybatis
一.整合思路 springmvc+mybaits的系统架构: 第一步整合dao层:mybatis和spring整合:通过spring管理mapper接口,使用mapper的扫描器自动扫描mapper接 ...
python多线程建立代理ip池
之前有写过用单线程建立代理ip池,但是大家很快就会发现,用单线程来一个个测试代理ip实在是太慢了,跑一次要很久才能结束,完全无法忍受.所以这篇文章就是换用多线程来建立ip池,会比用单线程快很多.之所以 ...
VC控件DateTimePicker使用方法
出自http://www.cnblogs.com/52yixin/articles/2111299.html 使用DateTimePicker控件一般是获取其时间替代手工输入带来的不便,而DateT ...
第一个简单netty程序
一个简单的netty的程序,主要是netty的客户端和服务端通信. 大部分说明都写在代码注释中 netty server TimeServer import io.netty.bootstrap.Se ...
近期开发storm遇到一些问题的解决点
storm开发解决问题点1.kafka消费速度跟不上问题这个问题可以从加大topic partition进行解决,可以在topic正在运行时候运行命令 ./kafka-topics --alter ...
2、设备树的规范（dts和dtb格式）
第01节_DTS格式(1) 语法:Devicetree node格式:[label:] node-name[@unit-address] { [properties definitions] ...
Laravel 指定日志生成目录
1.在config/logging.php 中, 创建自定义频道 2.使用时指定频道
Python基础——使用with结构打开多个文件
考虑如下的案例: 同时打开三个文件,文件行数一样,要求实现每个文件依次读取一行,然后输出,我们先来看比较容易想到的写法: with open(filename1, 'rb') as fp1: with ...