SLIQ/SPRINT
SLIQ/SPRINT
*/-->
SLIQ/SPRINT
Before SLIQ, most classification alogrithms have the problem that they do not scale. Because these alogrithms have the limit that the traning data should fit in memory. That's why SLIQ was raised.
1 Generic Decision-Tree Classification
Most decision-tree classifiers perform classification in two phases: Tree Building and Tree Pruning.
1.1 Tree Building
MakeTree(Training Data T)
Partition(T); Partition(Data S)
if(all points in S are in the same class) then return;
Evaluate splits for each attribute A
Use the best split found to partition S into S1 and S2
Partition(S1);
Partition(S2);
1.2 Tree Pruning
As we have known, no matter how your preprocess works, there always exist "noise" data or other bad data. So, when we use the traning data to build the decision-tree classification, it also create branches for thos bad data. These branches can lead to errors when classifying test data. Tree pruning is aimed at removing these braches from decision tree by selecting the subtree with the least estimated error rate.
2 Scalability Issues
2.1 Tree Building
As I mentioned, ID3/C4.5/Gini1 is used to evaluate the "goodness" of the alternative splits for an attribute.
2.1.1 Splits for Numeric Attribute
The cost of evaluating splits for a numeric attribute is dominated by the cost of sorting the values. Therefore, an important scalability issue is the reduction of sorting costs for numeric attributes.
2.1.2 Splits for Categorical Attribute
2.2 Tree Pruning
3 SLIQ Classifier
To achieve this pre-sorting, we use the following data structures. We create a separate list for each attribute of the training data. Additionally, a separate list,called class list , is created for the class labels attached to the examples. An entry in an attribute list has two fields: one contains an attribute value, the other anindex into the class list. An entry of the class list also has two fields: one contains a class label, the other a reference to a leaf node of the decision tree. The i th entry of the class list corresponds to the i th example in the training data. Each leaf node of the decision tree represents a partition of the training data, the partition being defined by the conjunction of the predicates on the path from the node to the root. Thus, the class list can at any time identify the partition to which an example belongs. We assume that there is enough memory to keep the class list memory-resident. Attribute lists are written to disk if necessary.
Footnotes:
SLIQ/SPRINT的更多相关文章
- TFS 2015 敏捷开发实践 – 在Kanban上运行一个Sprint
前言:在 上一篇 TFS2015敏捷开发实践 中,我们给大家介绍了TFS2015中看板的基本使用和功能,这一篇中我们来看一个具体的场景,如何使用看板来运行一个sprint.Sprint是Scrum对迭 ...
- Sprint计划
团队: 郭志豪:http://www.cnblogs.com/gzh13692021053/ 杨子健:http://www.cnblogs.com/yzj666/ 刘森松:http://www.cnb ...
- 计应152第六组Sprint计划会议
Sprint计划会议 会议时间:2016年12月8下午16:00 会议地点:宿舍 会议进程 • 首先我们讨论了排球计分规则程序完成需要做的一些工作:程序的初期设计,数据分析,典型用户,场景,代码的编写 ...
- HOW TO RUN A SPRINT PLANNING MEETING (THE WAY I LIKE IT)
This is a sample agenda for a sprint planning meeting. Depending on your context you will have to ch ...
- Sprint
Sprint冲刺 1.选题 <寿司点餐系统> 2.app名 <Sushi> 3.团名 ZEG 4.目标 制作一个成型的人性化的寿司点餐系统,介绍各种寿司的材料做法吃法以及价格, ...
- sprint 3 总结
1.要求: 演示可参考毕业设计答辩,包含两部分内容: 项目陈述,可综述项目.团队.开发过程等. 运行演示,实现的功能.业务.用户反馈等. 希望各组认真准备,拿出最好的阵容最好的状态,展示一学期的学习与 ...
- [课程设计]Sprint Three 回顾与总结&发表评论&团队贡献分
Sprint Three 回顾与总结&发表评论&团队贡献分 ● 一.回顾与总结 (1)回顾 燃尽图: Sprint计划-流程图: milestones完成情况如下: (2)总结 本次冲 ...
- TFS二次开发系列:八、TFS二次开发的数据统计以PBI、Bug、Sprint等为例(二)
上一篇文章我们编写了此例的DTO层,本文将数据访问层封装为逻辑层,提供给界面使用. 1.获取TFS Dto实例,并且可以获取项目集合,以及单独获取某个项目实体 public static TFSSer ...
- TFS二次开发系列:七、TFS二次开发的数据统计以PBI、Bug、Sprint等为例(一)
在TFS二次开发中,我们可能会根据某一些情况对各个项目的PBI.BUG等工作项进行统计.在本文中将大略讲解如果进行这些数据统计. 一:连接TFS服务器,并且得到之后需要使用到的类方法. /// < ...
随机推荐
- hook截获自定义消息
unit Unit1; interface uses Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms ...
- CodeForces (字符串从字母a开始删除k个字母)
You are given a string s consisting of n lowercase Latin letters. Polycarp wants to remove exactly k ...
- svnkit-常用api
0.功能列表 svnkit功能列表 1.递归获取指定目录下目录和文件,以树形展示[svn Update] 2.获取指定文件和属性(版本号.作者.日期.文件类型) 3.获取指定文件或目录的历史记录(版本 ...
- Go mod graphql-go 的 Replace
现在在项目中大量的使用 graphql,但用的版本是3年前的版本. 3年前包的url:github.com/neelance/graphql-go 现在的url:github.com/graph-go ...
- Gradle project sync failed. Please fix your project and try again
https://stackoverflow.com/questions/29808199/error-running-android-gradle-project-sync-failed-please ...
- nginx 报错Malformed HTTP request line, git 报错fatal: git-write-tree: error building trees
nginx 报错由于url里有空格,包括url本身或者参数有空格 git 报错是因为解决冲突的时候没有add,即没有merge
- 【按位dp】文盲的学习方法
当年大神的文章 <浅谈数位统计问题> 对于没什么文化(x 没有充分时间或懒得看那么多理论 应付个水考试的我 eg:62问题 某大大的代码和分析 #include <iostream& ...
- POJ 3585 Accumulation Degree【换根DP】
传送门:http://poj.org/problem?id=3585 题意:给定一张无根图,给定每条边的容量,随便取一点使得从这个点出发作为源点,发出的流量最大,并且输出这个最大的流量. 思路:最近开 ...
- POJ 3273 Monthly Expense二分查找[最小化最大值问题]
POJ 3273 Monthly Expense二分查找(最大值最小化问题) 题目:Monthly Expense Description Farmer John is an astounding a ...
- 计量经济与时间序列_关于Box-Jenkins的ARMA模型的经济学意义(重要思路)
1 很多人已经了解到AR(1)这种最简单的时间序列模型,ARMA模型包括AR模型和MA模型两个部分,这里要详细介绍Box-Jenkins模型的观念(有些资料中把ARMA模型叫做Box-Jenkins模 ...