ID3/C4.5/Gini Index

*/-->

ID3/C4.5/Gini Index

1 ID3

Select the attribute with the highest information gain.

1.1 Formula

1.2 Formula

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$

1.3 Example

age income Student creditrating buyscomputer
<= 30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes excellent yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.

So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$

In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$

1.4 Question

What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.

2 C4.5

C4.5 is a successor of ID3.

2.1 Formula

$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.

3 Gini Index

3.1 Formula

If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$

4 Summary

ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.

5 Other Attribute Selection Measures

1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others

Author: mlhy

Created: 2015-10-06 二 14:39

Emacs 24.5.1 (Org mode 8.2.10)

ID3/C4.5/Gini Index的更多相关文章

  1. Theoretical comparison between the Gini Index and Information Gain criteria

    Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for ...

  2. 决策树(ID3,C4.5,CART)原理以及实现

    决策树 决策树是一种基本的分类和回归方法.决策树顾名思义,模型可以表示为树型结构,可以认为是if-then的集合,也可以认为是定义在特征空间与类空间上的条件概率分布. [图片上传失败...(image ...

  3. ID3\C4.5\CART

    目录 树模型原理 ID3 C4.5 CART 分类树 回归树 树创建 ID3.C4.5 多叉树 CART分类树(二叉) CART回归树 ID3 C4.5 CART 特征选择 信息增益 信息增益比 基尼 ...

  4. 多分类度量gini index

    第一份工作时, 基于 gini index 写了一份决策树代码叫ctree, 用于广告推荐. 今天想起来, 好像应该有开源的其他方法了. 参考 https://www.cnblogs.com/mlhy ...

  5. 决策树模型 ID3/C4.5/CART算法比较

    决策树模型在监督学习中非常常见,可用于分类(二分类.多分类)和回归.虽然将多棵弱决策树的Bagging.Random Forest.Boosting等tree ensembel 模型更为常见,但是“完 ...

  6. 用于分类的决策树(Decision Tree)-ID3 C4.5

    决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...

  7. 决策树 ID3 C4.5 CART(未完)

    1.决策树 :监督学习 决策树是一种依托决策而建立起来的一种树. 在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树中的每一个分叉路径代表某 ...

  8. 21.决策树(ID3/C4.5/CART)

    总览 算法   功能  树结构  特征选择  连续值处理 缺失值处理  剪枝  ID3  分类  多叉树  信息增益   不支持 不支持  不支持 C4.5  分类  多叉树  信息增益比   支持 ...

  9. ID3,C4.5和CART三种决策树的区别

    ID3决策树优先选择信息增益大的属性来对样本进行划分,但是这样的分裂节点方法有一个很大的缺点,当一个属性可取值数目较多时,可能在这个属性对应值下的样本只有一个或者很少个,此时它的信息增益将很高,ID3 ...

随机推荐

  1. 77.Q表达式详解

    Q表达式可以包裹查询条件,可以在多个条件之间进行操作:与或非等.Q表达式一般会放在filter()中进行使用,F表达式一般是放在update()中进行使用. 定义模型的models.py文件中,示例代 ...

  2. java课程课后作业190530之用户体验评价

    每个人评价一下大家手头正在使用输入法或者搜索类的软件产品. 从用户界面.记住用户选择.短期刺激.长期使用的好处坏处.不要让用户犯简单的错误四个方面发表一篇博客. 输入法:苹果自带的输入法 用户界面:简 ...

  3. Python Learning Day6

    selenium操作 点击.清除操作 from selenium import webdriver from selenium.webdriver.common.keys import Keys im ...

  4. 二十八、CI框架之自己写分页类,符合CI的路径规范

    一.参照了CSDN上某个前辈写的一个CI分页类,自己删删改改仿写了一个类似的分页类,代码如下: 二.我们在模型里面写2个数据查询的函数,一个用于查询数据数量,一个用于查询出具体数据 三.我们在控制器里 ...

  5. Swift 3必看:从使用场景了解GCD新API

    https://www.jianshu.com/p/fc78dab5736f 2016.10.06 21:59* 在学习Swift 3的过程中整理了一些笔记,如果想看其他相关文章可前往<Swif ...

  6. argv从控制台输入多个参数

    arg多个参数: #!/usr/bin/env python3 import sys #控制台要输入的两个参数格式为:python script_name.py 参数1 参数2 input_file= ...

  7. E - Third-Party Software - 2 Gym - 102215E (贪心)

    Pavel is developing another game. To do that, he again needs functions available in a third-party li ...

  8. 修改完Apache的配置文件,重启Apache后,仍无法打开网页

    在修改Apache的配置文件时,由于某些非正常操作,导致httpd.conf文件非正常打开,需要继续enter进入, 这是会在httpd.conf同级目录中产生一个隐藏文件,.httpd.conf.s ...

  9. ACwing算法基础课听课笔记(第一章,基础算法二)(差分)

    前缀和以及二维前缀和在这里就不写了. 差分:是前缀和的逆运算 ACWING二维差分矩阵    每一个二维数组上的元素都可以用(x,y)表示,对于某一元素(x0,y0),其前缀和就是以该点作为右下角以整 ...

  10. GTK入门

    环境准备 官网下载 GTK 源码包,因为本机 GLib 版本不够,下载一个非最新版的 GTK3.8.0 先学习用 直接阅读 "/gtk+-3.8.0/docs/reference/gtk/h ...