［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值

官方链接：http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py

该例程是为了说明对缺失值的随即填充训练出的estimator表现优于直接删掉有缺失字段值的estimator

例程代码及附加注释如下：

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

import numpy as np

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer

from sklearn.model_selection import cross_val_score


# 设定随机数种子

rng = np.random.RandomState(0)

# 载入数据 波士顿房价

dataset = load_boston()

X_full, y_full = dataset.data, dataset.target

n_samples = X_full.shape[0]

n_features = X_full.shape[1]

# Estimate the score on the entire dataset, with no missing values
# 随机森林－－回归 random_state-随机种子 n_estimator 森林里树的数目

estimator = RandomForestRegressor(random_state=0, n_estimators=100)
# 交叉验证分类器的准确率

score = cross_val_score(estimator, X_full, y_full).mean()

print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines

missing_rate = 0.75

n_missing_samples = int(np.floor(n_samples * missing_rate))
# hstack 把两个数组拼接起来－行数需要一致

missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,

                                      dtype=np.bool),

                             np.ones(n_missing_samples,

                                     dtype=np.bool)))


# 打乱随机数组顺序
rng.shuffle(missing_samples)

missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values

X_filtered = X_full[~missing_samples, :]

y_filtered = y_full[~missing_samples]

estimator = RandomForestRegressor(random_state=0, n_estimators=100)

score = cross_val_score(estimator, X_filtered, y_filtered).mean()

print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values

X_missing = X_full.copy()

X_missing[np.where(missing_samples)[0], missing_features] = 0

y_missing = y_full.copy()

estimator = Pipeline([("imputer", Imputer(missing_values=0,

                                          strategy="mean",

                                          axis=0)),

                      ("forest", RandomForestRegressor(random_state=0,

                                                       n_estimators=100))])

score = cross_val_score(estimator, X_missing, y_missing).mean()

print("Score after imputation of the missing values = %.2f" % score)

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
补充：
A. numpy.where()用法：

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值的更多相关文章

[sklearn] 官方例程－Imputing missing values before building an estimator 随机填充缺失值
官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot- ...
Handling Missing Values
1) A Simple Option: Drop Columns with Missing Values 如果这些列具有有用信息(在未丢失的位置),则在删除列时,模型将失去对此信息的访问权限. 此外, ...
[Ruby on Rails系列]4、专题：Rails应用的国际化［i18n］
1. 什么是internationalization(i18n)? 国际化,英文简称i18n,按照维基百科的定义:国际化是指在设计软件,将软件与特定语言及地区脱钩的过程.当软件被移植到不同的语言及地区 ...
［干货］2017已来，最全面试总结——这些Android面试题你一定需要
地址.http://blog.csdn.net/xhmj12/article/details/54730883 相关阅读: 吊炸天!74款APP完整源码! [干货精品,值得收藏]超全的一线互联 ...
Git之（一）Git是什么［转］
为什么使用Git 孔子曾经曰过的,名正则言顺言顺则事成. 我们在学习一项新技术之前,弄清楚为什么要学它至关重要,至于为什么要学习Git,我用一段if-else语句告诉你原因: if(你相信我){ 我 ...
［caffe］linux下安装caffe（无cuda）以及python接口
昨天在mac上折腾了一天都没有安装成功,晚上在mac上装了一个ParallelDesktop虚拟机,然后装了linux,十分钟就安装好了,我也是醉了＝.＝主要过程稍微记录一下: 1.安装BLAS s ...
［Swift］基础
［Swift］基础一, 常用变量 var str = "Hello, playground" //变量 let str1="Hello xmj112288" ...
［译］一个灵活的 Trello 敏捷工作流
［译］一个灵活的 Trello 敏捷工作流翻译自 An Agile Trello Workflow That Keeps Tasks Flexible Getting things done 可不只 ...
iOS10收集IDFA，植入第三方广告［终结］－－ADMob
［PS: 前段时间,公司做ASO推广,需要在应用中收集IDFA值,跟广告平台做交互!于是有了这个需求--］ 1.首先,考虑了一下情况(自己懒 -_-#),就直接在首页上写了一个Banner,循环加载广 ...

随机推荐

Git详解之八：Git与其他系统
Git 与其他系统世界不是完美的.大多数时候,将所有接触到的项目全部转向 Git 是不可能的.有时我们不得不为某个项目使用其他的版本控制系统(VCS, Version Control System ...
使用 mysql PDO 防止sql注入
技巧: 1. php升级到5.3.6+,生产环境强烈建议升级到php 5.3.9+ php 5.4+,php 5.3.8存在致命的hash碰撞漏洞. 2. 若使用php 5.3.6+, 请在在PDO的 ...
《深入理解java虚拟机》 - 需要一本书来融汇贯通你的经验（下）
上一章讲到了类的加载机制,主要有传统派的双亲委派模型和现代主义激进派的 osgi 类加载器.接下来继续. 第8章虚拟机字节码执行引擎局部变量表,用于存储方法参数和方法内部定义的局部变量. 操 ...
Url Rewrite 再说Url 重写
前几天看到园子里一篇关于 Url 重写的文章<获取ISAPI_Rewrite重写后的URL>, URL-Rewrite 这项技术早已不是一项新技术了,这个话题也已经被很多人讨论过多次.搜索 ...
Head First设计模式之抽象工厂模式
一.定义给客户端提供一个接口,可以创建多个产品族中的产品对象 ,而且使用抽象工厂模式还要满足一下条件: 1)系统中有多个产品族,而系统一次只可能消费其中一族产品. 2)同属于同一个 ...
Python当前文件路径与文件夹删除操作
前言: Python的文件操作跟Java存在部分差异.由于项目需要,近期使用python进行模块开发时遇到一些常见的文件操作便上网搜罗了一番,感觉众说纷纭.因此,结合自身的使用场景,贴一段python ...
Windows同时安装Python2、Python3兼容运行
Python2.Python3可以同时安装在windows上,关键是如何有选择的运行Python2还是Python3. 指定运行版本如果你有一个Python文件叫 hello.py,那么你可以这 ...
[Spark性能调优] 源码补充 : Spark 2.1.X 中 Unified 和 Static MemoryManager
本课主题 Static MemoryManager 的源码鉴赏 Unified MemoryManager 的源码鉴赏引言从源码的角度了解 Spark 内存管理是怎么设计的,从而知道应该配置那个参 ...
Javascript一句代码实现JS字符串去除重复字符
需求: 原字符串:abcdabecd 去重后字符串:abcde JS字符串去重,一个简单需求,网上找案例发现都是一大堆代码,对于强迫症的我实再无法忍受,于是自己手动写出一段代码,完美解决该问题. 代 ...
ubuntu14.04下部署Tsung
我是在Windows 7下装的虚拟机里部署的Tsung,所以,以下均是在虚拟机下的操作: 1.网络问题必须搞定,见我的另外一篇博文 2.erlang的安装包.Tsung的安装包一一备齐.我用的是tsu ...

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值的更多相关文章

随机推荐

热门专题