Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This paper proposes a method called Isolation Forest (iForest) which detects anomalies purely based on the concept of isolation without employing any distance or density measure - fundamentally different from all existing methods.

As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement, and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.

  1. 1. INTRODUCTION

Anomalies are data patterns that have different data characteristics from normal instances. The ability to detect anomalies has significant relevance, and anomalies often provides critical and actionable information in various application domains. For example, anomalies in credit card transactions could signify fraudulent use of credit cards. An anomalous sport in an astronomy image could indicate the discovery of a new star. An unusual computer network traffic pattern could stand for an unauthorised access. These applications demand anomaly detection algorithms with high detection accuracy and fast execution.

Most existing anomaly detection approaches, including classification-based methods, Replicator Neural Network (RNN), one-class SVM and clustering-based methods, construct a profile of normal instances, then identify anomalies as those that do not conform to the normal profile. Their anomaly detection abilities are usually a 'side-effect' or by-product of an algorithm originally designed for a purpose other than anomaly detection (such as classification or clustering). This leads to two major drawbacks: (i) these approaches are not optimized to detect anomalies - as a consequence, these approaches often under-perform resulting in too many false alarms (having normal instances identified as anomalies) or too few anomalies being detected; (ii) many existing methods are constrained to low dimensional data and small data size because of the legacy of their original algorithm.

This paper proposes a different approach that detects anomalies by isolating instances, without relying on any distance or density measure. To achieve this, our proposed method takes advantage of two quantitative properties of anomalies: i) they are the minority consisting of few instances, and ii) they have attribute-values that are very different from those of normal instances. In other words, anomalies are 'few and different', which make them more susceptible to a mechanism we called Isolation. Isolation can be implemented by any means that separates instances. We opt to use a binary tree structure called isolation tree (iTree), which can be constructed effectively to isolate instances. Because of the susceptibility to isolation, anomalies are more likely to be isolated closer to the root of an iTree; whereas normal points are more likely to be isolated at the deeper end of an iTree. This forms the basis of our method to detect anomalies. Although, this is a very simple mechanism, we show in this paper that it is both effective an efficient in detecting anomalies.

The proposed method, called Isolation Forest (iForest), builds an ensemble of iTrees for a given data set; anomalies are those instances which have short average path lengths on the iTrees. There are two training parameters and one evaluation parameter in this method: the training parameters are the number of trees to build and subsampling size; the evaluation parameter is the tree height limit during evaluation. We show that iForest's detection accuracy converges quickly with a very small number of trees; it only requires a small subsampling size to achieve high detection accuracy with high efficiency; and the different height limits are used to cater for anomaly clusters of different density.

  1. 2. ISOLATION AND ISOLATION TREES

In this paper, the term isolation means 'separating an instance from the rest of the instances'. In general, an isolation-based method measures individual instances' susceptibility to be isolated; and anomalies are those that have the highest susceptibility. To realize the ideal of isolation, we turn to a data structure that naturally isolates data. In randomly generated binary trees where instances are recursively partitioned, these trees produce noticeable shorter paths for anomalies since (a) in the regions occupied by anomalies, less anomalies result in a smaller number of partitions - shorter paths in a tree structure, and (b) instances with distinguishable attribute - values are more likely to be separated early in the partitioning process. Hence, when a forest of random trees collectively produce shorter path lengths for some particular points, they are highly likely to be anomalies.

Definition: Isolation Tree. Let

Let proper binary tree, where each node in the tree has exactly zero or two daughter nodes. Assuming all instances are distinct, each instance is isolated to an external node when an iTree is fully grown, in which case the number of external nodes is

The task of anomaly detection is to provide a ranking that reflects the degree of anomaly. Using iTrees, the way to detect anomalies is to sort data points according to their average path lengths; and anomalies are points that are ranked at the top of the list. We define path length as follow:

Definition: Path Length

We employ path length as a measure of the degree of susceptibility to isolation:

  • short path length means high susceptibility to isolation,

  • long path length means low susceptibility to isolation.

3. ISOLATION, DENSITY AND DISTANCE MEASURES

In this paper, we assert that path-length-based isolation is more appropriate for the task of anomaly detection than the basic density and distance measures.

Using basic density measures, the assumption is that 'Normal points occur in dense regions, while anomalies occur in sparse regions'. Using basic distance measures, the basic assumption is that 'Normal point is close to its neighbours and anomaly is far from its neighbours'.

There are violations to these assumptions, e.g., high density and short distance do not always imply normal instances; likewise low density and long distance do not always imply anomalies. When density or distance is measured in a local context, which is often the case, points with high density or short distance could be anomalies in the global context of the entire data set. However, there is no ambiguity in path-length-based isolation and we demonstrate that in the following three paragraphs.

In density based anomaly detection, anomalies are defined to be data points in regions of low density. Density is commonly measured as (a) the reciprocal of the average distance to the

In distance based anomaly detection, anomalies are defined to be data points which are distant from all other points. Two common ways to define distance-based anomaly score are (i) the distance to

On the surface, the function of an isolation measure is similar to a density measure or a distance measure, i.e., isolation ranks scattered outlying points higher than normal points. However, we find that path length based isolation behaves differently form a density or distance measure, under data with different distributions. Path length, however is able to address this situation by giving the isolated dense points shorter path lengths. The main reason for this is that path length is grown in adaptive context, in which the context of each partitioning is different, from the first partition (the root node) in the context of the entire data set, to the last partition (the leaf node) in the context of local data-points. However, density (

In summary, we have compared three fundamental approaches to detect anomalies; they are isolation, density and distance. We find that the isolation measure (path length) is able to detect both clustered and scattered anomalies; whereas both distance and density measures can only detect scattered anomalies. While there are many ways to enhance the basic distance and density measures, the isolation measure is better because no further 'adjustment' to the basic measure is required to detect both clustered and scattered anomalies.

Isolation-based Anomaly Detection的更多相关文章

  1. Machine Learning - XV. Anomaly Detection异常检測 (Week 9)

    http://blog.csdn.net/pipisorry/article/details/44783647 机器学习Machine Learning - Andrew NG courses学习笔记 ...

  2. Anomaly Detection for Time Series Data with Deep Learning——本质分类正常和异常的行为,对于检测异常行为,采用预测正常行为方式来做

    A sample network anomaly detection project Suppose we wanted to detect network anomalies with the un ...

  3. Anomaly Detection

    数据集中的异常数据通常被成为异常点.离群点或孤立点等,典型特征是这些数据的特征或规则与大多数数据不一致,呈现出“异常”的特点,而检测这些数据的方法被称为异常检测. 异常数据根据原始数据集的不同可以分为 ...

  4. Time Series Anomaly Detection

    这里有个2015年的综述文章,概括的比较好,各种技术的适用场景.  https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concep ...

  5. PP: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications

    Problem: unsupervised anomaly detection for seasonal KPIs in web applications. Donut: an unsupervise ...

  6. PP: Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network

    PROBLEM: OmniAnomaly multivariate time series anomaly detection + unsupervised 主体思想: input: multivar ...

  7. anomaly detection algorithm

    anomaly detection algorithm 以上就是异常监测算法流程

  8. 斯坦福NG机器学习课程:Anomaly Detection笔记

    Anomaly Detection Problem motivation: 首先描写叙述异常检測的样例:飞机发动机异常检測 watermark/2/text/aHR0cDovL2Jsb2cuY3Nkb ...

  9. 论文笔记:Chaotic Invariants of Lagrangian Particle Trajectories for Anomaly Detection in Crowded Scenes

    [原创]Liu_LongPo 转载请注明出处 [CSDN]http://blog.csdn.net/llp1992 近期在关注 crowd scene方面的东西.由于某些原因须要在crowd scen ...

随机推荐

  1. springMvc 使用ajax上传文件,返回获取的文件数据 附Struts2文件上传

    总结一下 springMvc使用ajax文件上传 首先说明一下,以下代码所解决的问题 :前端通过input file 标签获取文件,通过ajax与后端交互,后端获取文件,读取excel文件内容,返回e ...

  2. Head First 设计模式-- 总结

    模式汇总:装饰者 :包装一个对象以得到新的行为状态   :封装了基于状态的行为,并使用委托在行为之间切换迭代器 :在对象的结合中游走,而不暴露集合的实现外观   :简化一群类的接口策略   :封装可以 ...

  3. iOS使用Security.framework进行RSA 加密解密签名和验证签名

    iOS 上 Security.framework为我们提供了安全方面相关的api: Security框架提供的RSA在iOS上使用的一些小结 支持的RSA keySize 大小有:512,768,10 ...

  4. 如何判断Javascript对象是否存在

    Javascript语言的设计不够严谨,很多地方一不小心就会出错. 举例来说,请考虑以下情况. 现在,我们要判断一个全局对象myObj是否存在,如果不存在,就对它进行声明.用自然语言描述的算法如下: ...

  5. windows 服务实例

    参考来源:http://blog.csdn.net/morewindows/article/details/6858216 参考来源: http://hi.baidu.com/tfantasy/ite ...

  6. HTML 格式化等处理方法

    1.处理特殊字符串,清除空格,换行等 function DeleteHtml($str) { $str = trim ( $str ); // 清除字符串两边的空格 $str = preg_repla ...

  7. 求子串-KPM模式匹配-NFA/DFA

    求子串 数据结构中对串的5种最小操作子集:串赋值,串比较,求串长,串连接,求子串,其他操作均可在该子集上实现 数据结构中串的模式匹配 KPM模式匹配算法 基本的模式匹配算法 //求字串subStrin ...

  8. 虚拟化--IO虚拟化基本原理

    本文话题: IO虚拟化概述 设备发现 访问截获 设备模拟 设备共享基于软件的IO虚拟化 基于前端后端的IO虚拟化基于硬件的IO虚拟化 概述 从处理器的角度看,外设是通过一组I/O资源(端口I/O或者是 ...

  9. 大家把做的公祭日的ps上传哦

    上传时图片保存为JPG,写上自己的学号,说说自己的创作构思

  10. Win服务器常用批处理脚本

    oracle数据库备份 先导出数据库,然后执行压缩,将源文件删除,保留压缩文件 exp crm/crm@orcl file=G:\数据库备份\CRM\CRM%DATE%.dmp owner=crm&q ...