SLIQ/SPRINT

*/-->

SLIQ/SPRINT

Before SLIQ, most classification alogrithms have the problem that they do not scale. Because these alogrithms have the limit that the traning data should fit in memory. That's why SLIQ was raised.

1 Generic Decision-Tree Classification

Most decision-tree classifiers perform classification in two phases: Tree Building and Tree Pruning.

1.1 Tree Building

MakeTree(Training Data T)
Partition(T); Partition(Data S)
if(all points in S are in the same class) then return;
Evaluate splits for each attribute A
Use the best split found to partition S into S1 and S2
Partition(S1);
Partition(S2);

1.2 Tree Pruning

As we have known, no matter how your preprocess works, there always exist "noise" data or other bad data. So, when we use the traning data to build the decision-tree classification, it also create branches for thos bad data. These branches can lead to errors when classifying test data. Tree pruning is aimed at removing these braches from decision tree by selecting the subtree with the least estimated error rate.

2 Scalability Issues

2.1 Tree Building

As I mentioned, ID3/C4.5/Gini1 is used to evaluate the "goodness" of the alternative splits for an attribute.

2.1.1 Splits for Numeric Attribute

The cost of evaluating splits for a numeric attribute is dominated by the cost of sorting the values. Therefore, an important scalability issue is the reduction of sorting costs for numeric attributes.

2.1.2 Splits for Categorical Attribute

2.2 Tree Pruning

3 SLIQ Classifier

To achieve this pre-sorting, we use the following data structures. We create a separate list for each attribute of the training data. Additionally, a separate list,called class list , is created for the class labels attached to the examples. An entry in an attribute list has two fields: one contains an attribute value, the other anindex into the class list. An entry of the class list also has two fields: one contains a class label, the other a reference to a leaf node of the decision tree. The i th entry of the class list corresponds to the i th example in the training data. Each leaf node of the decision tree represents a partition of the training data, the partition being defined by the conjunction of the predicates on the path from the node to the root. Thus, the class list can at any time identify the partition to which an example belongs. We assume that there is enough memory to keep the class list memory-resident. Attribute lists are written to disk if necessary.

Author: mlhy

Created: 2015-10-08 四 21:29

Emacs 24.5.1 (Org mode 8.2.10)

SLIQ/SPRINT的更多相关文章

  1. TFS 2015 敏捷开发实践 – 在Kanban上运行一个Sprint

    前言:在 上一篇 TFS2015敏捷开发实践 中,我们给大家介绍了TFS2015中看板的基本使用和功能,这一篇中我们来看一个具体的场景,如何使用看板来运行一个sprint.Sprint是Scrum对迭 ...

  2. Sprint计划

    团队: 郭志豪:http://www.cnblogs.com/gzh13692021053/ 杨子健:http://www.cnblogs.com/yzj666/ 刘森松:http://www.cnb ...

  3. 计应152第六组Sprint计划会议

    Sprint计划会议 会议时间:2016年12月8下午16:00 会议地点:宿舍 会议进程 • 首先我们讨论了排球计分规则程序完成需要做的一些工作:程序的初期设计,数据分析,典型用户,场景,代码的编写 ...

  4. HOW TO RUN A SPRINT PLANNING MEETING (THE WAY I LIKE IT)

    This is a sample agenda for a sprint planning meeting. Depending on your context you will have to ch ...

  5. Sprint

    Sprint冲刺 1.选题 <寿司点餐系统> 2.app名 <Sushi> 3.团名 ZEG 4.目标 制作一个成型的人性化的寿司点餐系统,介绍各种寿司的材料做法吃法以及价格, ...

  6. sprint 3 总结

    1.要求: 演示可参考毕业设计答辩,包含两部分内容: 项目陈述,可综述项目.团队.开发过程等. 运行演示,实现的功能.业务.用户反馈等. 希望各组认真准备,拿出最好的阵容最好的状态,展示一学期的学习与 ...

  7. [课程设计]Sprint Three 回顾与总结&发表评论&团队贡献分

    Sprint Three 回顾与总结&发表评论&团队贡献分 ● 一.回顾与总结 (1)回顾 燃尽图: Sprint计划-流程图: milestones完成情况如下: (2)总结 本次冲 ...

  8. TFS二次开发系列:八、TFS二次开发的数据统计以PBI、Bug、Sprint等为例(二)

    上一篇文章我们编写了此例的DTO层,本文将数据访问层封装为逻辑层,提供给界面使用. 1.获取TFS Dto实例,并且可以获取项目集合,以及单独获取某个项目实体 public static TFSSer ...

  9. TFS二次开发系列:七、TFS二次开发的数据统计以PBI、Bug、Sprint等为例(一)

    在TFS二次开发中,我们可能会根据某一些情况对各个项目的PBI.BUG等工作项进行统计.在本文中将大略讲解如果进行这些数据统计. 一:连接TFS服务器,并且得到之后需要使用到的类方法. /// < ...

随机推荐

  1. UML-基于GRASP对象设计步骤

    在OO设计建模的时候,在最后考虑系统启动时需要初始化的内容. 1.从用例开始,以下是一步步设计用例实现 处理销售 2.SSD 我们选择: makeNewSale 3.编写操作契约(复杂用例场景时) 4 ...

  2. Java学习十四

    学习内容: 1.Junit 2.maven安装配置环境 一.Junit实例演示步骤 1.引入jar包 junit包需要引入hamcrest-core包,否则会报错 2.测试如下代码 package c ...

  3. POJ 1979:Red and Black

    Red and Black Time Limit: 1000MS   Memory Limit: 30000K Total Submissions: 26058   Accepted: 14139 D ...

  4. Codeforces Round #571 (Unrated for Div. 1+Div. 2)

    A 略 B 被删了,被这个假题搞自闭了,显然没做出来. C 开始莽了个NTT,后来发现会TLE,其实是个SB前缀和,对于这题,我无**说. #include<bits/stdc++.h> ...

  5. Hadoop的FlieSystem类的使用

    1.使用FileSystem类需要导入jar包 解压hadoop-2.7.7.tar.gz 复制如下三个jar包和lib下所有jar包到项目文件下的lib文件 2.查看文件信息 @Test publi ...

  6. 深入分析Java反射(六)-反射调用异常处理

    前提 Java反射的API在JavaSE1.7的时候已经基本完善,但是本文编写的时候使用的是Oracle JDK11,因为JDK11对于sun包下的源码也上传了,可以直接通过IDE查看对应的源码和进行 ...

  7. WAMP常用环境配置

    自定义网站目录 修改目录位置 如下图,打开httpd.conf文件. 查找DocumentRoot(两处),做如下修改: #demo为自定义网站目录,下面不再说明 DocumentRoot " ...

  8. thinkCMF图片上传选择已上传图片

    1.找到上传图片的模板页面 webuploader.html 在上传文件标签后面 添加 <li class=""><a href="#explorer& ...

  9. 同时运行两个版本相同的tomcat

    由于项目需要,代理集群和一个节点都部署在本地,那么就需要有两个tomcat,一个部署集群,一个部署项目,我都用了7.0.34版本的tomcat 当启动代理的tomcat成功时,再启动节点的tomcat ...

  10. Python上楼梯

    假设一段楼梯共n(n>1)个台阶,小朋友一步最多能上3个台阶,那么小朋友上这段楼梯一共有多少种方法. (小朋友真的累,我选择电梯) 大体思路用到了递归,假如说楼梯有12阶,那么11阶时有只有一种 ...