ID3/C4.5/Gini Index

*/-->

ID3/C4.5/Gini Index

1 ID3

Select the attribute with the highest information gain.

1.1 Formula

1.2 Formula

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$

1.3 Example

age income Student creditrating buyscomputer
<= 30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes excellent yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.

So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$

In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$

1.4 Question

What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.

2 C4.5

C4.5 is a successor of ID3.

2.1 Formula

$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.

3 Gini Index

3.1 Formula

If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$

4 Summary

ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.

5 Other Attribute Selection Measures

1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others

Author: mlhy

Created: 2015-10-06 二 14:39

Emacs 24.5.1 (Org mode 8.2.10)

ID3/C4.5/Gini Index的更多相关文章

  1. Theoretical comparison between the Gini Index and Information Gain criteria

    Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for ...

  2. 决策树(ID3,C4.5,CART)原理以及实现

    决策树 决策树是一种基本的分类和回归方法.决策树顾名思义,模型可以表示为树型结构,可以认为是if-then的集合,也可以认为是定义在特征空间与类空间上的条件概率分布. [图片上传失败...(image ...

  3. ID3\C4.5\CART

    目录 树模型原理 ID3 C4.5 CART 分类树 回归树 树创建 ID3.C4.5 多叉树 CART分类树(二叉) CART回归树 ID3 C4.5 CART 特征选择 信息增益 信息增益比 基尼 ...

  4. 多分类度量gini index

    第一份工作时, 基于 gini index 写了一份决策树代码叫ctree, 用于广告推荐. 今天想起来, 好像应该有开源的其他方法了. 参考 https://www.cnblogs.com/mlhy ...

  5. 决策树模型 ID3/C4.5/CART算法比较

    决策树模型在监督学习中非常常见,可用于分类(二分类.多分类)和回归.虽然将多棵弱决策树的Bagging.Random Forest.Boosting等tree ensembel 模型更为常见,但是“完 ...

  6. 用于分类的决策树(Decision Tree)-ID3 C4.5

    决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...

  7. 决策树 ID3 C4.5 CART(未完)

    1.决策树 :监督学习 决策树是一种依托决策而建立起来的一种树. 在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树中的每一个分叉路径代表某 ...

  8. 21.决策树(ID3/C4.5/CART)

    总览 算法   功能  树结构  特征选择  连续值处理 缺失值处理  剪枝  ID3  分类  多叉树  信息增益   不支持 不支持  不支持 C4.5  分类  多叉树  信息增益比   支持 ...

  9. ID3,C4.5和CART三种决策树的区别

    ID3决策树优先选择信息增益大的属性来对样本进行划分,但是这样的分裂节点方法有一个很大的缺点,当一个属性可取值数目较多时,可能在这个属性对应值下的样本只有一个或者很少个,此时它的信息增益将很高,ID3 ...

随机推荐

  1. 5 ~ express ~ 连接数据库

    1, 在schema 目录创建 users.js 文件,通过 mongoose 模块来操作数据库 2,  在定义 users 表结构之前,需要让应用支持或连接数据库 . 所以要在应用的入口文件 app ...

  2. html特殊字符的写法

    符号 写法 (空格)   <(小于号) < >(大于号) > " " ®(已注册) ® ©(版权) © ™(商标) ™ (半方大的空白)   (全方大的空白 ...

  3. python3.7使用etree遇到的问题

    使用python3.6时安装好lxml时按照许多网上的教程来引入会发现etree没被引入进来 解决办法: 一.import lxml.htmletree = lxml.html.etree这样就可以使 ...

  4. 颜色设置 <color name="white">#FFFFFF</color><!--白色 -->

    <?xml version="1.0" encoding="utf-8"?> <resources> <color name=&q ...

  5. JAVAEE 和项目开发(第二课:HTTP协议的特点和交互流程)

    HTTP 的概念和介绍 概念:超文本传输协议(Hyper Text Transfer Protocol) 作用:规范了浏览器和服务器的数据交互 特点: 简单快速:客户向服务器请求服务时,只需传送请求方 ...

  6. JavaBean和json数据之间的转换(一)简单的JavaBean转换

    1.为什么要使用json? JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,因为其高性能.可读性强的原因,成为了现阶段web开发中前后端交互数据的主要数据 ...

  7. 一天一个设计模式——Prototype 原型模式

    一.模式说明 看了比较多的资料,对原型模式写的比较复杂,个人的理解就是模型复制,根据现有的类来直接创建新的类,而不是调用类的构造函数. 那为什么不直接调用new方法来创建类的实例呢,主要一个原因是如果 ...

  8. soupui--替换整个case的url

    添加新的URL 随便进入一个case的[REST]step,添加新的url 更换URL 添加完之后双击想要更换url的case,在弹出的窗口中点击URL按钮 在弹出的set endpoint窗口中选择 ...

  9. 备份 分区表 mbr

    备份方法:   1.备份分区表信息 sudo fdisk -l >hda.txt  #分区表信息重定向输出到文件中 2.备份MBR linux@linux-desktop:~/ex$ sudo ...

  10. 强大的代码生成器——T4模板

    T4 Editor工具下载地址 tangible T4 Editor 2.5.0 plus modeling tools for VS 2019 https://marketplace.visuals ...