【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意
Features
From: Terry Brugger
Date: 15 Sep 2007
Subject: KDD Cup '99 dataset (Network Intrusion) considered harmful
Oftentimes in the scientific community, we become interested in new techniques or approaches based on characteristics of the technique or approach itself. While such investigation may be informative from a pure research standpoint, the general public -- and particularly most research sponsors -- tend to be more interested in the application of this technology. To this end, the KDD Cup Challenge has, for over ten years, provided the KDD community with datasets from real world problems to demonstrate the applicability and performance of different knowledge discovery techniques. Researchers in the computer security community (based on the tone of papers published at the time) were initially excited to see a problem from their domain adopted for the 1999 KDD Cup Challenge. Since then, however, the dataset has become widely discredited. This letter is intended to briefly outline the problems that have been cited with the KDD Cup '99 dataset, and discourage its further use.
The KDD Cup '99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by Lincoln Lab under contract to DARPA [Lippmann et al]. Since one can not know the intention (benign or malicious) of every connection on a real world network (if we could, we would not need research in intrusion detection), the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks. It was intended to simulate the traffic seen in a medium sized US Air Force base (and was created in collaboration with the AFRL in Rome, NY, which could be characterized as a medium sized US Air Force base).
Based on the published description of how the data was generated, McHugh published a fairly harsh criticism of the dataset. Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic. Indeed, even a cursory examination of the data showed that the data rates were far below what will be experienced in a real medium sized network. Nevertheless, IDS researchers continued to use the dataset (and the KDD Cup dataset that was derived from it) for lack of anything better.
In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254. This served to demonstrate to most people in the network security research community that the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them. Numerous researchers indicated to us (in personal conversations) that if they were reviewing a paper based solely on the DARPA dataset, they would reject it solely on that basis.
Indeed, at the time we were conducting our own assessment of the DARPA dataset, using Snort [Caswell and Roesch]. Trivial detection using the TTL aside, we found that it was still useful to evaluate the true positive performance of a network IDS; however, any false positive results were meaningless [Brugger and Chow]. Anonymous reviewers at respectable information security conferences were unimpressed; one noted, ``is there any interest to study the capacities of SNORT on such data?''. A reviewer from another conference summarized their review with ``The content of the paper is really out of date. If this paper appears five years ago, there is some value, but not much now.''
While the DARPA (and KDD Cup '99) dataset has fallen from grace in the network security community, we still see it widely used in the greater KDD community. Examples in the past couple years include [Kayacik et al.], [Sarasamma et al.], [Gao et al.], [Chan et al.], and [Zhang et al.]. While this sample doesn't necessarily represent the top-tier journals and conferences in the KDD community, they are to the best of our knowledge respectable, peer-reviewed publications. Obviously, the knowledge discovery researchers are well intentioned by wanting to show the usefulness of every technique imaginable to the network intrusion detection domain. Unfortunately, due to the problems with the dataset, such conclusions can not be drawn. As a result, we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset, (2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and (3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
S Terry Brugger, zow at acm dot org
UC Davis, Department of Computer Science
References
- Brugger, S. T. and J. Chow (January 2007). An assessment of the DARPA IDS Evaluation Dataset using Snort. Technical Report CSE-2007-1, University of California, Davis, Department of Computer Science, Davis, CA.http://www.cs.ucdavis.edu/research/tech-reports/2007/CSE-2007-1.pdf.
- Caswell, B. and M. Roesch (16 May 2004). Snort: The open source network intrusion detection system. http://www.snort.org/.
- Chan, A. P., W. W. Y. Ng, D. S. Yeung, and E. C. C. Tsang ( 19-21 August 2005). Comparison of different fusion approaches for network intrusion detection using ensemble of RBFNN. In Proc. of 2005 Intl. Conf. on Machine Learning and Cybernetics, Volume 6, Guangzhou, China, pp. 3846-3851. IEEE.
- Hai-Hua Gao, Hui-Hua Yang, X.-Y. W. (27-29 August 2005). Principal component neural networks based intrusion feature extraction and detection using SVM. In Advances in Natural Computation, Volume 3611 of Lecture Notes in Computer Science, Changsha, China, pp. 21-27. Springer.
- Kayacik, H. G., A. N. Zincir-Heywood, and M. I. Heywood (June 2007). A hierarchical SOM-based intrusion detection system. Engineering Applications of Artificial Intelligence 20 (4), 439-451. Full text not available; analysis based on detailed abstract.
- Lippmann, R. P., D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. K. Cunningham, and M. Zissman (January 2000). Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In Proc. of the DARPA Information Survivability Conference and Exposition, Los Alamitos, CA. IEEE Computer Society Press.
- Mahoney, M. V. and P. K. Chan (8-10 September 2003). An analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for network anomaly detection. In G. Vigna, E. Jonsson, and C. Krugel (Eds.), Proc. 6th Intl. Symp. on Recent Advances in Intrusion Detection (RAID 2003), Volume 2820 of Lecture Notes in Computer Science, Pittsburgh, PA, pp. 220-237. Springer.
- McHugh, J. (2000). Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Information System Security 3 (4), 262-294.
- Sarasamma, S. T., Q. A. Zhu, and J. Huff (April 2005). Hierarchical Kohonenen net for anomaly detection in network security. IEEE Trans. Syst., Man, Cybern. B 35 (2), 302-312.
- Zhang, C., J. Jiang, and M. Kamel (May 2005). Intrusion detection using hierarchical neural networks. Pattern Recognition Letters 26 (6), 779-791.
All opinions expressed are solely the view of the author(s), and are not necessarily shared or endorsed by The University of California, Davis, or their employer(s).
原文地址:
http://www.kdnuggets.com/news/2007/n18/4i.html
【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意的更多相关文章
- KDD Cup 99网络入侵检测数据的分析
看论文 该数据集是从一个模拟的美国空军局域网上采集来的 9 个星期的网络连接数据, 分成具有标识的训练数据和未加标识的测试数据.测试数据和训练数据有着不同的概率分布, 测试数据包含了一些未出现在训练数 ...
- kdd cup 2019
比赛简介: 任务1:推荐最佳交通方式 任务描述:给定用户的一些信息,预测用户使用何种最佳交通方式由O(起点)到D(终点) 数据描述: profiles.csv: 属性pid:用户的ID: 属性p0~p ...
- Kdd Cup 2013 总结2
- 5-Spark高级数据分析-第五章 基于K均值聚类的网络流量异常检测
据我们所知,有‘已知的已知’,有些事,我们知道我们知道:我们也知道,有 ‘已知的未知’,也就是说,有些事,我们现在知道我们不知道.但是,同样存在‘不知的不知’——有些事,我们不知道我们不知道. 上一章 ...
- 网络安全中机器学习大合集 Awesome
网络安全中机器学习大合集 from:https://github.com/jivoi/awesome-ml-for-cybersecurity/blob/master/README_ch.md#-da ...
- R2CNN模型——用于文本目标检测的模型
引言 R2CNN全称Rotational Region CNN,是一个针对斜框文本检测的CNN模型,原型是Faster R-CNN,paper中的模型主要针对文本检测,调整后也可用于航拍图像的检测中去 ...
- 机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集
机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集 选自Microsoft www.tz365.Cn 作者:Lee Scott 机器之心编译 参与:李亚洲.吴攀. ...
- 美团:WSDM Cup 2019自然语言推理任务获奖解题思路
WSDM(Web Search and Data Mining,读音为Wisdom)是业界公认的高质量学术会议,注重前沿技术在工业界的落地应用,与SIGIR一起被称为信息检索领域的Top2. 刚刚在墨 ...
- 史无前例的KDD 2014大会记
2014大会记" title="史无前例的KDD 2014大会记"> 作者:蒋朦 微软亚洲研究院实习生 创造多项纪录的KDD 2014 ACM SIGKDD 国际会 ...
随机推荐
- java打包遇到问题java.io.IOException: invalid header field
问题:java打包时报以下错误 $ jar -cvmf main.txt test.jar Shufile1.class java.io.IOException: invalid header fie ...
- LeetCode OJ 147. Insertion Sort List
Sort a linked list using insertion sort. Subscribe to see which companies asked this question 解答 对于链 ...
- iOS开发-xcdatamodeld文件 CoreData的介绍和使用,sqlite的使用
CoreData的介绍和使用 源引:http://www.jianshu.com/p/d027090af00e CoreData是数据存储的一种方式,CoreData实质也是对SQLite的封装. ...
- 选择合适的String拼接方法(这篇博客是我抄的)
package com.test; public class FreeFile { public static void main(String[] args) { // 加号拼接 String st ...
- iOS.StaticLibrary.1-avoid-duplicate-symbol-in-static-library[draft]
Avoid duplicate symbol in static library and its customer 发布static library给使用者使用.在实际的工程实践中,iOS静态库一般会 ...
- 咏南IOCP REST中间件
咏南IOCP REST中间件 让DELPHI7也能编写REST服务. 使用IOCP通信+UNIDAC数据库引擎. 客户端跨开发语言调用.
- Spring 4 官方文档学习(十一)Web MVC 框架之编码式Servlet容器初始化
在Servlet 3.0+ 环境中,你可以编码式配置Servlet容器,用来代替或者结合 web.xml文件.下面是注册DispatcherServlet : import org.springfra ...
- 专题:mdadm Raid & LVM
>FOR FREEDOM!< {A} Introduction Here's a short description of what is supported in the Linux R ...
- getContextPath、getServletPath、getRequestURI的区别
假定你的web application 项目名称为news,你在浏览器中输入请求路径: http://localhost:8080/news/main/list.jsp 则执行下面向行代码后打印出如下 ...
- NSLog函数重写
跟C++的输出函数相比,NSlog函数有个很大的优势,就是它可以输出对象. 在实际使用过程中,我们可以通过实现description函数来实现对NSLog函数的重写 -(NSString*)descr ...