异常值（outlier）

简介

在数据挖掘的过程中，我们可能会经常遇到一些偏离于预测趋势之外的数据，通常我们称之为异常值。

通常将这样的一些数据的出现归为误差。有很多情况会出现误差，具体的情况需要就对待：

传感器故障　　　->　　忽略

数据输入错误　　->　　忽略

反常事件　　　　->　　重视

异常值检测/删除算法

1、训练数据

2、异常值检测，找出训练集中访问最多的点，去除这些点（一般约10%的异常数据）

3、再训练

需要多次重复2、3步骤

例：对数据第一次使用回归后的拟合

误差点的出现使拟合线相对偏离，将误差点去除后进行一次回归：

去除误差点后的回归线很好的对数据进行了拟合

代码实现

环境：MacOS mojave　　10.14.3

Python　　3.7.0

使用库：scikit-learn 0.19.2

原始数据集：

对原始数据进行一次回归：

删除10%的异常值后进行一次回归：
　　　　

outlier_removal_regression.py　　主程序

#!/usr/bin/python

import random

import numpy

import matplotlib.pyplot as plt

import pickle

from outlier_cleaner import outlierCleaner

class StrToBytes:

    def __init__(self, fileobj):

        self.fileobj = fileobj

    def read(self, size):

        return self.fileobj.read(size).encode()

    def readline(self, size=-1):

        return self.fileobj.readline(size).encode()

### load up some practice data with outliers in it

ages = pickle.load(StrToBytes(open("practice_outliers_ages.pkl", "r") ) )

net_worths = pickle.load(StrToBytes(open("practice_outliers_net_worths.pkl", "r") ) )

### ages and net_worths need to be reshaped into 2D numpy arrays

### second argument of reshape command is a tuple of integers: (n_rows, n_columns)

### by convention, n_rows is the number of data points

### and n_columns is the number of features

ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))

net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

from sklearn.cross_validation import train_test_split

ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

### fill in a regression here!  Name the regression object reg so that

### the plotting code below works, and you can see what your regression looks like

from sklearn import linear_model

reg = linear_model.LinearRegression()

reg.fit(ages_train,net_worths_train)

print (reg.coef_)

print (reg.intercept_)

print (reg.score(ages_test,net_worths_test) )

try:

    plt.plot(ages, reg.predict(ages), color="blue")

except NameError:

    pass

plt.scatter(ages, net_worths)

plt.show()

### identify and remove the most outlier-y points

cleaned_data = []

try:

    predictions = reg.predict(ages_train)

    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )

except NameError:

    print ("your regression object doesn't exist, or isn't name reg")

    print ("can't make predictions to use in identifying outliers")

### only run this code if cleaned_data is returning data

if len(cleaned_data) > 0:

    ages, net_worths, errors = zip(*cleaned_data)

    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))

    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!

    try:

        reg.fit(ages, net_worths)

        plt.plot(ages, reg.predict(ages), color="blue")

        print (reg.coef_)

        print (reg.intercept_)

        print (reg.score(ages_test,net_worths_test) )

    except NameError:

        print ("you don't seem to have regression imported/created,")

        print ("   or else your regression object isn't named reg")

        print ("   either way, only draw the scatter plot of the cleaned data")

    plt.scatter(ages, net_worths)

    plt.xlabel("ages")

    plt.ylabel("net worths")

    plt.show()

else:

    print ("outlierCleaner() is returning an empty list, no refitting to be done")

outlier_cleaner.py　　清除10%的异常值

import numpy as np

import math

def outlierCleaner(predictions, ages, net_worths):

    """

        Clean away the 10% of points that have the largest

        residual errors (difference between the prediction

        and the actual net worth).

        Return a list of tuples named cleaned_data where

        each tuple is of the form (age, net_worth, error).

    """

    cleaned_data = []

    ages = ages.reshape((1,len(ages)))[0]

    net_worths = net_worths.reshape((1,len(ages)))[0]

    predictions = predictions.reshape((1,len(ages)))[0]

    # zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。

    cleaned_data = zip(ages,net_worths,abs(net_worths-predictions))

    #按照error大小排序

    cleaned_data = sorted(cleaned_data , key=lambda x: (x[2]))

    #ceil() 函数返回数字的上入整数，计算要删除的元素个数

    cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1))

    #切片

    cleaned_data = cleaned_data[:cleaned_num]

    return cleaned_data

同时得到这两次回归的拟合优度：

第一次：0.8782624703664675

第二次：0.983189455395532

可见，去除异常值对于预测数据具有重要作用

异常值（outlier）的更多相关文章

python异常值(outlier)检测实战:KMeans + PCA + IsolationForest + SVM + EllipticEnvelope
机器学习_深度学习_入门经典(博主永久免费教学视频系列) https://study.163.com/course/courseMain.htm?courseId=1006390023&sha ...
SD与SE的关系，以及异常值
很多刚进入实验室的同学对实验数据的标准差(SD)与标准误(SE)的含义搞不清,不知道自己的数据报告到底该用SD还是SE.这里对这两个概念进行一些介绍. 标准差(SD)强调raw data的Variat ...
(转)Decision Tree
Decision Tree:Analysis 大家有没有玩过猜猜看(Twenty Questions)的游戏?我在心里想一件物体,你可以用一些问题来确定我心里想的这个物体:如是不是植物?是否会飞?能游 ...
平均值(Mean)、方差(Variance)、标准差(Standard Deviation) （转）
http://blog.csdn.net/xidiancoder/article/details/71341345 平均值平均值的概念很简单:所有数据之和除以数据点的个数,以此表示数据集的平均大小: ...
[译]用R语言做挖掘数据《六》
异常值检测一.实验说明 1. 环境登录无需密码自动登录,系统用户名shiyanlou,密码shiyanlou 2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到程序: ...
支持向量机SVM、优化问题、核函数
1.介绍它是一种二类分类模型,其基本模型定义为特征空间上的间隔最大的线性分类器,即支持向量机的学习策略便是间隔最大化,最终可转化为一个凸二次规划问题的求解. 2.求解过程 1.数据分类—SVM引入 ...
【Udacity】数据的差异性：值域、IQR、方差和标准差
一.值域(Range) Range = Max - Min 受异常值(Outliers)影响二.四分位差(IQR) 四分位距(interquartile range, IQR),又称四分差.是描述统 ...
用随机森林分类器和GBDT进行特征筛选
一.决策树(类型.节点特征选择的算法原理.优缺点.随机森林算法产生的背景) 1.分类树和回归树由目标变量是离散的还是连续的来决定的:目标变量是离散的,选择分类树:反之(目标变量是连续的,但自变量可以 ...
opencv之SURF图像匹配
1.概述前面介绍模板匹配的时候已经提到模板匹配时一种基于灰度的匹配方法,而基于特征的匹配方法有FAST.SIFT.SURF等.上面两篇文章已经介绍过使用Surf算法进行特征点检測以及使用暴力匹配(B ...
ML-软间隔(slack)的 SVM
Why Slack? 为了处理异常值(outlier). 前面推导的svm形式, 是要求严格地全部分对, 基于该情况下, 在margin 的边界线线上的点, 只能是支持向量. \(min_w \ \ ...

随机推荐

django rest-farme-work 的使用(2)
serialization (序列化) 本测试项目例子地址为: tomchristie/rest-framework-tutorial 开始构建一个新的程序创建一个新的环境 virtualenv e ...
Windows使用docker打开新窗口error解决办法
环境 win7 Error: error during connect: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.26/containers/json ...
GROUP BY 和 ORDER BY 的一起使用
GROUP BY 和 ORDER BY一起使用写程序也有很长的一段时间了,有些东西我总不曾去思考,很少去积累一些有用的东西,总喜欢"用要即拿"的心态来对待,这是非常不好的坏习惯. ...
[SharePoint][SharePoint Designer 入门经典]Chapter10 Web部件链接
本章概要: 1.Web部件作用 2.如何添加和配置 3.如何个性化 4.如何导出,并在其他站点重利用 5.通过组合web part创建复杂的用户界面
JDBC 具体解释(1)
JDBC 具体解释(1) 在以java application server应用为主的平台是,JDBC的最高级应用是DataSource的实现,其他的JDO,webcache,hibe ...
Uva 12012 Detection of Extraterrestrial 求循环节个数为1-n的最长子串长度 KMP
题目链接:option=com_onlinejudge&Itemid=8&page=show_problem&problem=3163">点击打开链接题意: ...
godoc工具使用
golang除了语言有一定的规范外,对于文档的生成也是非常不错的.仅仅要按go的格式来写的程序,都能够非常easy的生成文档. godoc命令介绍: http://golang.org/cmd/god ...
luogu 3952 时间复杂度（模拟）
时间复杂度这道题从两个月前开始做,一直没做出来,最后今晚决心一定要做出来.于是开始认真的在打草纸上写思路,最后在AC的那一刻,差点哭了出来!! 题目大意这个自己看吧,noip2017的D1T2 s ...
MySQL 5.7 zip 文件安装过程
1.下载路径 https://dev.mysql.com/downloads/mysql/ 有账号登陆下载, 没有账号:no thanks;just start my download 2.解 ...
Comparable与Comparator区别（实现和使用）
一.Comparable接口 1.Comparable接口是什么? 此接口强行对实现它的每个类的对象进行整体排序.此排序被称为该类的自然排序 ,类的 compareTo 方法被称为它的自然比较方法 . ...

异常值（outlier）

异常值（outlier）的更多相关文章

随机推荐

热门专题