Handling Missing Values
1) A Simple Option: Drop Columns with Missing Values
如果这些列具有有用信息(在未丢失的位置),则在删除列时,模型将失去对此信息的访问权限。 此外,如果您的测试数据在您的训练数据没有的地方缺少值,则会导致错误。
data_without_missing_values = original_data.dropna(axis=1) #同时操作tran和test部分
cols_with_missing = [col for col in original_data.columns
if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)
2) A Better Option: Imputation
默认行为填写了插补的平均值。 统计学家已经研究了更复杂的策略,但是一旦将结果插入复杂的机器学习模型,那些复杂的策略通常没有任何好处。
关于Imputation的一个(很多)好处是它可以包含在scikit-learn Pipeline中。 管道简化了模型构建,模型验证和模型部署。
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
3) An Extension To Imputation
估算是标准方法,通常效果很好。 但是,估算值可能系统地高于或低于其实际值(未在数据集中收集)。 或者具有缺失值的行可能以某种其他方式看来是唯一的。 在这种情况下,您的模型会通过考虑最初缺少哪些值来做出更好的预测。
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy() # make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns
if new_data[col].isnull().any())
for col in cols_with_missing:
new_data[col + '_was_missing'] = new_data[col].isnull() # Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns
Example (Comparing All Solutions)
import pandas as pd # Load data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv') from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1) # For the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object']) from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,
melb_target,
train_size=0.7,
test_size=0.3,
random_state=0) def score_dataset(X_train, X_test, y_train, y_test):
model = RandomForestRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)
return mean_absolute_error(y_test, preds) # Get Model Score from Dropping Columns with Missing Values
# 直接丢弃含有缺失值的列
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test)) # Get Model Score from Imputation
# 插入值
from sklearn.impute import SimpleImputer my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test)) # Get Score from Imputation with Extra Columns Showing What Was Imputed
# 添加额外列显示缺失值
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy() cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull() # Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus) print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
Handling Missing Values的更多相关文章
- [sklearn]官方例程-Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
- [sklearn] 官方例程-Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
- Multi-batch TMT reveals false positives, batch effects and missing values(解读人:胡丹丹)
文献名:Multi-batch TMT reveals false positives, batch effects and missing values (多批次TMT定量方法中对假阳性率,批次效应 ...
- 缺失值处理(Missing Values)
什么是缺失值?缺失值指数据集中某些变量的值有缺少的情况,缺失值也被称为NA(not available)值.在pandas里使用浮点值NaN(Not a Number)表示浮点数和非浮点数组中的缺失值 ...
- Web Scraping with R: How to Fill Missing Value (爬虫:如何处理缺失值)
网络上有大量的信息与数据.我们可以利用爬虫技术来获取这些巨大的数据资源. 这次用 IMDb 网站的2018年100部最欢迎的电影 来练练手,顺便总结一下 R 爬虫的方法. >> Prepa ...
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON ...
- Kaggle:Home Credit Default Risk 特征工程构建及可视化(2)
博主在之前的博客 Kaggle:Home Credit Default Risk 数据探索及可视化(1) 中介绍了 Home Credit Default Risk 竞赛中一个优秀 kernel 关于 ...
- 【转】The most comprehensive Data Science learning plan for 2017
I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...
- data cleaning
Cleaning data in Python Table of Contents Set up environments Data analysis packages in Python Cle ...
随机推荐
- iOS之UITableView加载网络图片cell自适应高度
#pragma mark- UITableView - (CGFloat)tableView:(UITableView *)tableView heightForRowAtIndexPath:(NSI ...
- eclipse中使用lombok不生效
eclipse中使用lombok,在实体类中添加@Data后,还是不能调用get.set方法.需要修改eclipse配置 1.将 lombok.jar 复制到eclipse.ini同级目录.下载的lo ...
- vue 使用QRcode生成二维码或在线生成二维码
参考:https://blog.csdn.net/zhuswy/article/details/80267748 1.安装qrcode.js npm install qrcodejs2 --save ...
- linux-c getopt()参数处理函数
转自:https://www.cnblogs.com/qingergege/p/5914218.html 最近在弄Linux C编程,本科的时候没好好学啊,希望学弟学妹们引以为鉴. 好了,虽然啰嗦了点 ...
- NFS和mount常用参数详解 本文目录
NFS和mount常用参数详解 本文目录 NFS权限参数配置 mount挂载参数 原始驱动程序的挂载选项. 新驱动程序的挂载选项. 怎样改变已经挂载的NTFS卷的权限? 怎样自动挂载一个NTFS卷 ...
- MapReduce 图解流程超详细解答(2)-【map阶段】
接上一篇讲解:http://blog.csdn.net/mrcharles/article/details/50465626 map任务:溢写阶段 正如我们在执行阶段看到的一样,map会使用Mappe ...
- Leetcode949. Largest Time for Given Digits给定数字能组成最大的时间
给定一个由 4 位数字组成的数组,返回可以设置的符合 24 小时制的最大时间. 最小的 24 小时制时间是 00:00,而最大的是 23:59.从 00:00 (午夜)开始算起,过得越久,时间越大. ...
- [转]WPF中的动画
WPF中的动画 周银辉 动画无疑是WP ...
- 【DM642学习笔记七】缓冲区管理BufferManagement
(参考文档:The TMS320DM642 VideoPort Mini-Driver ——3.2 Buffer Management) 认识输入/输出视频数据的缓冲区管理,有利于对视频图 ...
- 每日上亿请求量的电商系统,JVM年轻代垃圾回收参数如何优化? ----实战教会你如何配置
目录: 案例背景引入 特殊的电商大促场景 抗住大促的瞬时压力需要几台机器? 大促高峰期订单系统的内存使用模型估算 内存到底该如何分配? 新生代垃圾回收优化之一:Survivor空间够不够 新生代对象躲 ...