Three ways to detect outliers
Z-score
import numpy as np def outliers_z_score(ys):
threshold = 3 mean_y = np.mean(ys)
stdev_y = np.std(ys)
z_scores = [(y - mean_y) / stdev_y for y in ys]
return np.where(np.abs(z_scores) > threshold)
Modified Z-score
import numpy as np def outliers_modified_z_score(ys):
threshold = 3.5 median_y = np.median(ys)
median_absolute_deviation_y = np.median([np.abs(y - median_y) for y in ys])
modified_z_scores = [0.6745 * (y - median_y) / median_absolute_deviation_y
for y in ys]
return np.where(np.abs(modified_z_scores) > threshold)
IQR(interquartile range)
import numpy as np def outliers_iqr(ys):
quartile_1, quartile_3 = np.percentile(ys, [25, 75])
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - (iqr * 1.5)
upper_bound = quartile_3 + (iqr * 1.5)
return np.where((ys > upper_bound) | (ys < lower_bound))
Conclusion
It is important to reiterate that these methods should not be used mechanically.
They should be used to explore the data – they let you know which points might be worth a closer look.
What to do with this information depends heavily on the situation.
Sometimes it is appropriate to exclude outliers from a dataset to make a model trained on that dataset more predictive.
Sometimes, however,
the presence of outliers is a warning sign that the real-world process generating the data is more complicated than expected. As an astute commenter on CrossValidated put it:
“Sometimes outliers are bad data, and should be excluded, such as typos.
Sometimes they are Wayne Gretzky or Michael Jordan, and should be kept.” Domain knowledge and practical wisdom are the only ways to tell the difference.
摘自:http://colingorrie.github.io/outlier-detection.html
Three ways to detect outliers的更多相关文章
- MATLAB remove outliers.
Answer by Richard Willey on 9 Jan 2012 Hi Michael MATLAB doesn't provide a specific function to remo ...
- 异常值处理outlier
python信用评分卡(附代码,博主录制) https://study.163.com/course/introduction.htm?courseId=1005214003&utm_camp ...
- 深度学习图像配准 Image Registration: From SIFT to Deep Learning
Image Registration is a fundamental step in Computer Vision. In this article, we present OpenCV feat ...
- (转)利用libcurl和国内著名的两个物联网云端通讯的例程, ubuntu和openwrt下调试成功(四)
1. libcurl 的参考文档如下 CURLOPT_HEADERFUNCTION Pass a pointer to a function that matches the following pr ...
- 如何查找物理cpu,cpu核心和逻辑cpu的数量
环境 Red Hat Enterprise Linux 4 Red Hat Enterprise Linux 5 Red Hat Enterprise Linux 6 Red Hat Enterpri ...
- Gartner 2018 年WAF魔力象限报告:云WAF持续增长,Bot管理与API安全拥有未来
Gartner 2018 年WAF魔力象限报告:云WAF持续增长,Bot管理与API安全拥有未来 来源 https://www.freebuf.com/articles/paper/184903.ht ...
- A Gentle Guide to Machine Learning
A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence tha ...
- [linux]linuxI/O测试的方法之dd
参考http://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd Measuring Write Performan ...
- WGCNA构建基因共表达网络详细教程
这篇文章更多的是对于混乱的中文资源的梳理,并补充了一些没有提到的重要参数,希望大家不会踩坑. 1. 简介 1.1 背景 WGCNA(weighted gene co-expression networ ...
随机推荐
- Redis其他常用操作
详细Redis操作手册: http://doc.redisfans.com/ ============================================================= ...
- Linux查询进程和结束进程
1. ps -ef |grep redis ps:将某个进程显示出来-A 显示所有程序. -e 此参数的效果和指定"A"参数相同.-f 显示UID,PPIP,C与STIME栏位. ...
- Vue 自定义一个插件的用法、小案例及在项目中的应用
1.开发插件 install有两个参数,第一个是Vue构造器,第二个参数是一个可选的选项对象 MyPlugin.install = function (Vue, options) { // 1 ...
- Dockerfile 规范
https://time-track.cn/compile-docker-from-source.html 参考 https://time-track.cn/install-docker-on-ubu ...
- 日志切割之Logrotate
1.关于日志切割 日志文件包含了关于系统中发生的事件的有用信息,在排障过程中或者系统性能分析时经常被用到.对于忙碌的服务器,日志文件大小会增长极快,服务器会很快消耗磁盘空间,这成了个问题.除此之外,处 ...
- Spring Security(二十九):9.4.1 ExceptionTranslationFilter
ExceptionTranslationFilter is a Spring Security filter that has responsibility for detecting any Spr ...
- jeecg字典表-系统字典
新建字典 录入字典信息 添加类型 录入完类型后效果 新建列表用户 保存 同步数据库 同步完之后,对应的数据库中会创建对应的表. 测试表功能 保存之后,数据库保存对应的字段 生成代码 刷新工程之后,生成 ...
- Linux下部署开源版“禅道”项目管理系统
1.开源版安装包下载 [root@iZbp ~]# wget http://dl.cnezsoft.com/zentao/9.0.1/ZenTaoPMS.9.0.1.zbox_64.tar.gz 2. ...
- MySQL中的float和decimal类型有什么区别
decimal 类型可以精确地表示非常大或非常精确的小数.大至 1028(正或负)以及有效位数多达 28 位的数字可以作为 decimal类型存储而不失其精确性.该类型对于必须避免舍入错误的应用程序( ...
- SpringBoot 统一时区的方案
系统采用多时区设计的时候,往往我们需要统一时区,需要统一的地方如下: 服务器(Tomcat服务) 数据库(JPA + Hibernate) 前端数据(前端采用Vuejs) 思路为:将数据库和服务器的时 ...