1. ID3 算法

ID3 算法是一种典型的决策树（decision tree）算法，C4.5, CART都是在其基础上发展而来。决策树的叶子节点表示类标号，非叶子节点作为属性测试条件。从树的根节点开始，将测试条件用于检验记录，根据测试结果选择恰当的分支；直至到达叶子节点，叶子节点的类标号即为该记录的类别。

ID3采用信息增益（information gain）作为分裂属性的度量，最佳分裂等价于求解最大的信息增益。

信息增益=parent节点熵 - 带权的子女节点的熵

ID3算法流程如下：

1.如果节点的所有类标号相同，停止分裂；

2.如果没有feature可供分裂，根据多数表决确定该节点的类标号，并停止分裂；

3.选择最佳分裂的feature，根据选择feature的值逐一进行分裂；递归地构造决策树。

源代码（从[1]中拿过来）：

from math import log

import operator

import matplotlib.pyplot as plt  

def calcEntropy(dataSet):

    """calculate the shannon entropy"""

    numEntries=len(dataSet)

    labelCounts={}

    for entry in dataSet:

        entry_label=entry[-1]

        if entry_label not in labelCounts:

            labelCounts[entry_label]=0

        labelCounts[entry_label]+=1  

    entropy=0.0

    for key in labelCounts:

        prob=float(labelCounts[key])/numEntries

        entropy-=prob*log(prob,2)  

    return entropy  

def createDataSet():

    dataSet = [[1, 1, 'yes'],

            [1, 1, 'yes'],

            [1, 0, 'no'],

            [0, 1, 'no'],

            [0, 1, 'no']]

    labels = ['no surfacing','flippers']

    return dataSet, labels  

def splitDataSet(dataSet,axis,pivot):

    """split dataset on feature"""

    retDataSet=[]

    for entry in dataSet:

        if entry[axis]==pivot:

            reduced_entry=entry[:axis]

            reduced_entry.extend(entry[axis+1:])

            retDataSet.append(reduced_entry)

    return retDataSet  

def bestFeatureToSplit(dataSet):

    """chooose the best feature to split """

    numFeatures=len(dataSet[0])-1

    baseEntropy=calcEntropy(dataSet)

    bestInfoGain=0.0; bestFeature=-1

    for axis in range(numFeatures):

        #create unique list of class labels

        featureList=[entry[axis] for entry in dataSet]

        uniqueFeaList=set(featureList)

        newEntropy=0.0

        for value in uniqueFeaList:

            subDataSet=splitDataSet(dataSet,axis,value)

            prob=float(len(subDataSet))/len(dataSet)

            newEntropy+=prob*calcEntropy(subDataSet)

        infoGain=baseEntropy-newEntropy

        #find the best infomation gain

        if infoGain>bestInfoGain:

            bestInfoGain=infoGain

            bestFeature=axis

    return bestFeature  

def majorityVote(classList):

    """take a majority vote"""

    classCount={}

    for vote in classList:

        if vote not in classCount.keys():

            classCount[vote]=0

        classCount+=1

    sortedClassCount=sorted(classCount.iteritems(),

            key=operator.itemgetter(1),reverse=True)

    return sortedClassCount[0][0]  

def createTree(dataSet,labels):

    classList=[entry[-1] for entry in dataSet]

    #stop when all classes are equal

    if classList.count(classList[0])==len(classList):

        return classList[0]

    #when no more features, return majority vote

    if len(dataSet[0])==1:

        return majorityVote(classList)  

    bestFeature=bestFeatureToSplit(dataSet)

    bestFeatLabel=labels[bestFeature]

    myTree={bestFeatLabel:{}}

    del(labels[bestFeature])

    subLabels=labels[:]

    featureList=[entry[bestFeature] for entry in dataSet]

    uniqueFeaList=set(featureList)

    #split dataset according to the values of the best feature

    for value in uniqueFeaList:

        subDataSet=splitDataSet(dataSet,bestFeature,value)

        myTree[bestFeatLabel][value]=createTree(subDataSet,subLabels)

    return myTree

分类结果可视化

2. Referrence

[1] Peter Harrington, machine learning in action.

【数据挖掘】分类之decision tree（转载）的更多相关文章

CART分类与回归树与GBDT(Gradient Boost Decision Tree)
一.CART分类与回归树资料转载: http://dataunion.org/5771.html Classification And Regression Tree(CART)是决策 ...
机器学习算法实践：决策树 (Decision Tree)（转载）
前言最近打算系统学习下机器学习的基础算法,避免眼高手低,决定把常用的机器学习基础算法都实现一遍以便加深印象.本文为这系列博客的第一篇,关于决策树(Decision Tree)的算法实现,文中我将对决 ...
数据挖掘决策树 Decision tree
数据挖掘-决策树 Decision tree 目录数据挖掘-决策树 Decision tree 1. 决策树概述 1.1 决策树介绍 1.1.1 决策树定义 1.1.2 本质 1.1.3 决策树的组 ...
用于分类的决策树(Decision Tree)-ID3 C4.5
决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...
（ZT）算法杂货铺——分类算法之决策树(Decision tree)
https://www.cnblogs.com/leoo2sk/archive/2010/09/19/decision-tree.html 3.1.摘要在前面两篇文章中,分别介绍和讨论了朴素贝叶斯分 ...
Spark2 ML包之决策树分类Decision tree classifier详细解说
所用数据源,请参考本人博客http://www.cnblogs.com/wwxbi/p/6063613.html 1.导入包 import org.apache.spark.sql.SparkSess ...
【分类算法】决策树（Decision Tree）
(注:本篇博文是对<统计学习方法>中决策树一章的归纳总结,下列的一些文字和图例均引自此书~) 决策树(decision tree)属于分类/回归方法.其具有可读性.可解释性.分类速度快等优 ...
【机器学习实战】第3章决策树（Decision Tree）
第3章决策树 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/ ...
决策树Decision Tree 及实现
Decision Tree 及实现标签: 决策树熵信息增益分类有监督 2014-03-17 12:12 15010人阅读评论(41) 收藏举报分类: Data Mining(25) Pyt ...

随机推荐

集合框架（04）HashMap扩展知识
Map扩展知识 map集合被使用是具备映射关系 “bigclass”: “001”, ”zhangsan” “002”, ”lisi” “smallclass” : ”001”, “wangwu” : ...
springmvc使用StringHttpMessageConverter需要配置编码
Spring controller 如下 @Controller public class SimpleController { @ResponseBody @RequestMapping(value ...
Apache2 httpd.conf 配置详解
Apache2 httpd.conf 配置详解 <第一部分> 常用配置指令说明 1. ServerRoot:服务器的基础目录,一般来说它将包含conf/和logs/子目录,其它配置文件的相 ...
一篇文章让你彻底弄懂WinForm GDI 编程基本原理
一 GDI编程原理 GDI(Graphics Device Interface,图形设备接口),主要负责Windows系统与绘图程序之间的信息交换,处理所有Windows程序的图形输出. GDI的常用 ...
pr_debug、dev_dbg等动态调试一
内核版本:Linux-3.14 作者:彭东林邮箱:pengdonglin137@163.com pr_debug: #if defined(CONFIG_DYNAMIC_DEBUG) /* dyna ...
threadlocal彻底理解
如果你定义了一个单实例的java bean,它有若干属性,但是有一个属性不是线程安全的,比如说HashMap.并且碰巧你并不需要在不同的线程中共享这个属性,也就是说这个属性不存在跨线程的意义.那么你不 ...
[置顶] kubernetes资源类型--pod和job
pod Pod是K8S的最小操作单元,一个Pod可以由一个或多个容器组成:整个K8S系统都是围绕着Pod展开的,比如如何部署运行Pod.如何保证Pod的数量.如何访问Pod等. 特点 Pod是能够被创 ...
设计模式之装饰器模式(PHP实现)
/** * 装饰器模式(Decorator Pattern)允许向一个现有的对象添加新的功能,同时又不改变其结构.这种类型的设计模式属于结构型模式,它是作为现有的类的一个包装. * 这种模式创建了一个 ...
ES里关于数组的拓展
一.静态方法在ES6以前,创建数组的方式主要有两种,一种是调用Array构造函数,另一种是用数组字面量语法,这两种方法均需列举数组中的元素,功能非常受限.如果想将一个类数组对象(具有数值型索引和le ...
zabbix_sender高效模式
1.zabbix_sender介绍 zabbix获取key值有超时时间,如果自定义的key脚本一般需要执行很长时间,这根本没法去做监控,获取数据有超时时间,如果一些数据需要执行比较长的时间才能获取的话 ...

【数据挖掘】分类之decision tree（转载）

1. ID3 算法

2. Referrence

【数据挖掘】分类之decision tree（转载）的更多相关文章

随机推荐

热门专题