Machine Learning in Action(2) 决策树算法

决策树也是有监督机器学习方法。电影《无耻混蛋》里有一幕游戏，在德军小酒馆里有几个人在玩20问题游戏，游戏规则是一个设迷者在纸牌中抽出一个目标（可以是人，也可以是物），而猜谜者可以提问题，设迷者只能回答是或者不是，在几个问题（最多二十个问题）之后，猜谜者通过逐步缩小范围就准确的找到了答案。这就类似于决策树的工作原理。（图一）是一个判断邮件类别的工作方式，可以看出判别方法很简单，基本都是阈值判断，关键是如何构建决策树，也就是如何训练一个决策树。

（图一）

构建决策树的伪代码如下：

Check if every item in the dataset is in the same class:

If so return the class label

Else

find the best feature to split the data

split the dataset

create a branch node

for each split

call create Branch and add the result to the branch node

return branch node

原则只有一个，尽量使得每个节点的样本标签尽可能少，注意上面伪代码中一句说：find the best feature to split the data，那么如何find thebest feature?一般有个准则就是尽量使得分支之后节点的类别纯一些，也就是分的准确一些。如（图二）中所示，从海洋中捞取的5个动物，我们要判断他们是否是鱼，先用哪个特征？

（图二）

为了提高识别精度，我们是先用“能否在陆地存活”还是“是否有蹼”来判断？我们必须要有一个衡量准则，常用的有信息论、基尼纯度等，这里使用前者。我们的目标就是选择使得分割后数据集的标签信息增益最大的那个特征，信息增益就是原始数据集标签基熵减去分割后的数据集标签熵，换句话说，信息增益大就是熵变小，使得数据集更有序。熵的计算如（公式一）所示：

（公式一）

有了指导原则，那就进入代码实战阶段，先来看看熵的计算代码：

 def calcShannonEnt(dataSet):

     numEntries = len(dataSet)

     labelCounts = {}

     for featVec in dataSet: #the the number of unique elements and their occurance

         currentLabel = featVec[-1]

         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0

         labelCounts[currentLabel] += 1  #收集所有类别的数目，创建字典

     shannonEnt = 0.0

     for key in labelCounts:

         prob = float(labelCounts[key])/numEntries

         shannonEnt -= prob * log(prob,2) #log base 2  计算熵

     return shannonEnt

有了熵的计算代码，接下来看依照信息增益变大的原则选择特征的代码：

 def splitDataSet(dataSet, axis, value):

     retDataSet = []

     for featVec in dataSet:

         if featVec[axis] == value:

             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting

             reducedFeatVec.extend(featVec[axis+1:])

             retDataSet.append(reducedFeatVec)

     return retDataSet

 def chooseBestFeatureToSplit(dataSet):

     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels

     baseEntropy = calcShannonEnt(dataSet)

     bestInfoGain = 0.0; bestFeature = -1

     for i in range(numFeatures):        #iterate over all the features

         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature

         uniqueVals = set(featList)       #get a set of unique values

         newEntropy = 0.0

         for value in uniqueVals:

             subDataSet = splitDataSet(dataSet, i, value)

             prob = len(subDataSet)/float(len(dataSet))

             newEntropy += prob * calcShannonEnt(subDataSet)

         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy

         if (infoGain > bestInfoGain):       #compare this to the best gain so far    #选择信息增益最大的代码在此

             bestInfoGain = infoGain         #if better than current best, set to best

             bestFeature = i

     return bestFeature                      #returns an integer

从最后一个if可以看出，选择使得信息增益最大的特征作为分割特征，现在有了特征分割准则，继续进入一下个环节，如何构建决策树，其实就是依照最上面的伪代码写下去，采用递归的思想依次分割下去，直到执行完成就构建了决策树。代码如下：

 def majorityCnt(classList):

     classCount={}

     for vote in classList:

         if vote not in classCount.keys(): classCount[vote] = 0

         classCount[vote] += 1

     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

     return sortedClassCount[0][0]

 def createTree(dataSet,labels):

     classList = [example[-1] for example in dataSet]

     if classList.count(classList[0]) == len(classList):

         return classList[0]#stop splitting when all of the classes are equal

     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet

         return majorityCnt(classList)

     bestFeat = chooseBestFeatureToSplit(dataSet)

     bestFeatLabel = labels[bestFeat]

     myTree = {bestFeatLabel:{}}

     del(labels[bestFeat])

     featValues = [example[bestFeat] for example in dataSet]

     uniqueVals = set(featValues)

     for value in uniqueVals:

         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels

         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)

     return myTree

用图二的样本构建的决策树如（图三）所示：

（图三）

有了决策树，就可以用它做分类咯，分类代码如下：

 def classify(inputTree,featLabels,testVec):

     firstStr = inputTree.keys()[0]

     secondDict = inputTree[firstStr]

     featIndex = featLabels.index(firstStr)

     key = testVec[featIndex]

     valueOfFeat = secondDict[key]

     if isinstance(valueOfFeat, dict):

         classLabel = classify(valueOfFeat, featLabels, testVec)

     else: classLabel = valueOfFeat

     return classLabel

最后给出序列化决策树（把决策树模型保存在硬盘上）的代码：

 def storeTree(inputTree,filename):

     import pickle

     fw = open(filename,'w')

     pickle.dump(inputTree,fw)

     fw.close()

 def grabTree(filename):

     import pickle

     fr = open(filename)

     return pickle.load(fr)

优点：检测速度快

缺点：容易过拟合，可以采用修剪的方式来尽量避免

以上内容来至群友博客http://blog.csdn.net/marvin521/article/details/9255977

Ps:决策树算法以其简单、清晰、高效的性能，在传统行业应用非常广泛，常见流失预警，客户的目标响应模型等，同时也是我最喜欢的一个算法,经典的决策树算法有CART和C5.0，前者是二叉树，适合并行，之前在学校的时候也在cuda架构上写过这个算法的并行程序，后者可以是多叉树。在此算法基础上的ensemble集成(bagging,adaboost、gbdt,bagging+random subspace = random forest、(pca+subspace)+tree = rotation forest)性能有很大的提升，至于决策树剪枝则可以选用不同的优化指标，采用前向还是后向剪。

Machine Learning in Action(2) 决策树算法的更多相关文章

机器学习实战（Machine Learning in Action）学习笔记————05.Logistic回归
机器学习实战(Machine Learning in Action)学习笔记————05.Logistic回归关键字:Logistic回归.python.源码解析.测试作者:米仓山下时间:2018- ...
机器学习实战（Machine Learning in Action）学习笔记————03.决策树原理、源码解析及测试
机器学习实战(Machine Learning in Action)学习笔记————03.决策树原理.源码解析及测试关键字:决策树.python.源码解析.测试作者:米仓山下时间:2018-10-2 ...
Machine Learning in Action(5) SVM算法
做机器学习的一定对支持向量机(support vector machine-SVM)颇为熟悉,因为在深度学习出现之前,SVM一直霸占着机器学习老大哥的位子.他的理论很优美,各种变种改进版本也很多,比如 ...
《Machine Learning in Action》—— 剖析支持向量机，单手狂撕线性SVM
<Machine Learning in Action>-- 剖析支持向量机,单手狂撕线性SVM 前面在写NumPy文章的结尾处也有提到,本来是打算按照<机器学习实战 / Machi ...
《Machine Learning in Action》—— 剖析支持向量机，优化SMO
<Machine Learning in Action>-- 剖析支持向量机,优化SMO 薄雾浓云愁永昼,瑞脑销金兽. 愁的很,上次不是更新了一篇关于支持向量机的文章嘛,<Machi ...
《Machine Learning in Action》—— Taoye给你讲讲决策树到底是支什么“鬼”
<Machine Learning in Action>-- Taoye给你讲讲决策树到底是支什么"鬼" 前面我们已经详细讲解了线性SVM以及SMO的初步优化过程,具体 ...
《Machine Learning in Action》—— 小朋友，快来玩啊，决策树呦
<Machine Learning in Action>-- 小朋友,快来玩啊,决策树呦在上篇文章中,<Machine Learning in Action>-- Taoye ...
《Machine Learning in Action》—— 懂的都懂，不懂的也能懂。非线性支持向量机
说在前面:前几天,公众号不是给大家推送了第二篇关于决策树的文章嘛.阅读过的读者应该会发现,在最后排版已经有点乱套了.真的很抱歉,也不知道咋回事,到了后期Markdown格式文件的内容就解析出现问题了, ...
《Machine Learning in Action》—— Taoye给你讲讲Logistic回归是咋回事
在手撕机器学习系列文章的上一篇,我们详细讲解了线性回归的问题,并且最后通过梯度下降算法拟合了一条直线,从而使得这条直线尽可能的切合数据样本集,已到达模型损失值最小的目的. 在本篇文章中,我们主要是手撕 ...

随机推荐

NoSQL数据库 Couchbase Server - 分布式缓存
Couchbase Server (前身是 Membase) 是一个分布式的面向文档的 NoSQL 数据库管理系统,该系统联合了 CouchDB 的简单和可靠以及 Memcached 的高性能以及 M ...
delphi 四舍五入Round函数【百帖整理】
在最近版本的Delphi Pascal 编译器中,Round 函数是以 CPU 的 FPU (浮点部件) 处理器为基础的.这种处理器采用了所谓的 "银行家舍入法",即对中间值 (如 ...
LeetCode OJ--Merge Sorted Array *
http://oj.leetcode.com/problems/merge-sorted-array/ 两个有序数组A和B的归并排序,将结果存到A中.因为已知两数组长度且A的数组足够大,所以倒着处理, ...
Java 添加播放MIDI音乐
Java 在多媒体处理方面的确优势不大,但是我们在程序中有些时候又需要一些音乐. 如果播放的音乐是wav等波形音频文件,又很大的话,所以背景音乐最好就是MIDI了. 网上很多播放MIDI的教程都是 ...
(3)unity3d 地形
在Hierarchy(层次) 建一个Terrain(地形) Terrain属性按钮第一个按钮:抬升与下陷地面.单击抬升地形,同时按住shift下陷地形第二个按钮:绘制高度.同时按住shift绘制等 ...
[Python Cookbook] IPython: An Interactive Computing Environment
You can launch IPython on the command line just like launching the regular Python interpreter except ...
ML | k-means
what's xxx k-means clustering aims to partition n observations into k clusters in which each observa ...
Network | NAT
在计算机网络中,网络地址转换(Network Address Translation或简称NAT),也叫做网络掩蔽或者IP掩蔽(IP masquerading),是一种在IP封包通过路由器或防火墙时重 ...
POJ 3140 Contestants Division (树dp)
题目链接:http://poj.org/problem?id=3140 题意: 给你一棵树,问你删去一条边,形成的两棵子树的节点权值之差最小是多少. 思路: dfs #include <iost ...
扩展欧几里得算法(exGCD)学习笔记
@(学习笔记)[扩展欧几里得] 本以为自己学过一次的知识不会那么容易忘记, 但事实证明, 两个星期后的我就已经不会做扩展欧几里得了...所以还是写一下学习笔记吧问题概述求解: \[ax + by ...

Machine Learning in Action(2) 决策树算法

Machine Learning in Action(2) 决策树算法的更多相关文章

随机推荐

热门专题