Three ways to detect outliers
Z-score
import numpy as np def outliers_z_score(ys):
threshold = 3 mean_y = np.mean(ys)
stdev_y = np.std(ys)
z_scores = [(y - mean_y) / stdev_y for y in ys]
return np.where(np.abs(z_scores) > threshold)
Modified Z-score
import numpy as np def outliers_modified_z_score(ys):
threshold = 3.5 median_y = np.median(ys)
median_absolute_deviation_y = np.median([np.abs(y - median_y) for y in ys])
modified_z_scores = [0.6745 * (y - median_y) / median_absolute_deviation_y
for y in ys]
return np.where(np.abs(modified_z_scores) > threshold)
IQR(interquartile range)
import numpy as np def outliers_iqr(ys):
quartile_1, quartile_3 = np.percentile(ys, [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
return np.where((ys > upper_bound) | (ys < lower_bound))
Conclusion
It is important to reiterate that these methods should not be used mechanically.
They should be used to explore the data – they let you know which points might be worth a closer look.
What to do with this information depends heavily on the situation.
Sometimes it is appropriate to exclude outliers from a dataset to make a model trained on that dataset more predictive.
Sometimes, however,
the presence of outliers is a warning sign that the real-world process generating the data is more complicated than expected. As an astute commenter on CrossValidated put it:
“Sometimes outliers are bad data, and should be excluded, such as typos.
Sometimes they are Wayne Gretzky or Michael Jordan, and should be kept.” Domain knowledge and practical wisdom are the only ways to tell the difference.
摘自:http://colingorrie.github.io/outlier-detection.html
Three ways to detect outliers的更多相关文章
- MATLAB remove outliers.
Answer by Richard Willey on 9 Jan 2012 Hi Michael MATLAB doesn't provide a specific function to remo ...
- 异常值处理outlier
python信用评分卡(附代码,博主录制) https://study.163.com/course/introduction.htm?courseId=1005214003&utm_camp ...
- 深度学习图像配准 Image Registration: From SIFT to Deep Learning
Image Registration is a fundamental step in Computer Vision. In this article, we present OpenCV feat ...
- (转)利用libcurl和国内著名的两个物联网云端通讯的例程, ubuntu和openwrt下调试成功(四)
1. libcurl 的参考文档如下 CURLOPT_HEADERFUNCTION Pass a pointer to a function that matches the following pr ...
- 如何查找物理cpu,cpu核心和逻辑cpu的数量
环境 Red Hat Enterprise Linux 4 Red Hat Enterprise Linux 5 Red Hat Enterprise Linux 6 Red Hat Enterpri ...
- Gartner 2018 年WAF魔力象限报告:云WAF持续增长,Bot管理与API安全拥有未来
Gartner 2018 年WAF魔力象限报告:云WAF持续增长,Bot管理与API安全拥有未来 来源 https://www.freebuf.com/articles/paper/184903.ht ...
- A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...
- [linux]linuxI/O测试的方法之dd
参考http://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd Measuring Write Performan ...
- WGCNA构建基因共表达网络详细教程
这篇文章更多的是对于混乱的中文资源的梳理,并补充了一些没有提到的重要参数,希望大家不会踩坑. 1. 简介 1.1 背景 WGCNA(weighted gene co-expression networ ...
随机推荐
- ssr
使用Nuxt.js改造已有项目的方法 https://www.jb51.net/article/145203.htm
- audio
// media.cpp : 定义控制台应用程序的入口点. // https://wenku.baidu.com/view/e910c474c5da50e2524d7fb4.html https:// ...
- Java线程锁,synchronized、wait、notify详解
(原) JAVA多线程这一块有点绕,特别是对于锁,对锁机制理解不清的话,程序出现了问题也很难找到原因,在此记录一下线程的执行以及各种锁. 1.JAVA中,每个对象有且只有一把锁(lock),也叫监视器 ...
- [已决解]关于Hadoop start-all.sh启动问题
问题一:出现Attempting to operate on hdfs namenode as root 写在最前注意: 1.master,slave都需要修改start-dfs.sh,stop-df ...
- php常用数组array函数实例总结【赋值,拆分,合并,计算,添加,删除,查询,判断,排序】
本文实例总结了php常用数组array函数.分享给大家供大家参考,具体如下: array_combine 功能:用一个数组的值作为新数组的键名,另一个数组的值作为新数组的值 案例: <?php ...
- wxWidgets 和 QT 之间的选择
(非原创,网络摘抄) 跨平台的C++ GUI工具库很多,可是应用广泛的也就那么几个,Qt.wxWidgets便是其中的翘楚这里把GTK+排除在外,以C实现面向对象,上手相当困难,而且Windows平台 ...
- SpringBoot前端模板
Springboot支持thymeleaf.freemarker.JSP,但是官方不建议使用JSP,因为有些功能会受限制,这里介绍thymeleaf和freemarker. 一.thymeleaf模板 ...
- SpringBoot标准Properties
# =================================================================== # COMMON SPRING BOOT PROPERTIE ...
- springboot打jar包正常无法访问页面
网上看到太多说版本换成 1.4.2.RELEASE. 可以将程序打成war包发布, 1.启动类改为 @Overrideprotected SpringApplicationBuilder config ...
- matplotlib使用
import numpy as np import matplotlib.pyplot as plt 生成数据 mean1=[5,5] cov1=[[1,1],[1,1.5]] data=np.ran ...