python ID3决策树实现
环境:ubuntu 16.04 python 3.6
数据来源:UCI wine_data(比较经典的酒数据)
决策树要点:
1、 如何确定分裂点(CART ID3 C4.5算法有着对应的分裂计算方式)
2、 如何处理不连续的数据,如果处理缺失的数据
3、 剪枝处理
尝试实现算法一是为了熟悉python,二是为了更好的去理解算法的一个流程以及一些要点的处理。
from math import log
import operator
import pickle
import os
import numpy as np def debug(value_name,value):
print("debuging for %s" % value_name)
print(value) # feature map and wind_label def loadDateset():
with open('./wine.data') as f:
wine = [eaxm.strip().split(',') for eaxm in f.readlines()] #for i in range(len(wine)):
# wine[i] = list(map(float,wine[i])) wine = np.array(wine)
wine_label = wine[...,:1]
wine_data = wine[...,1:] # get the map of wine_feature
featLabels = [] for i in range(len(wine_data)):
#print(i)
featLabels.append(i) #
wine_data = np.concatenate((wine_data,wine_label),axis=1)
# 这里的label需要做一定的修改 需要的label是属性对应的字典
return wine_data,featLabels # wine_data = dateset[:-1] wine_label = dateset[-1:]
def informationEntropy(dataSet):
m = len(dataSet)
labelMap = {}
for wine in dataSet:
nowLabel = wine[-1]
if nowLabel not in labelMap.keys():
labelMap[nowLabel] = 0
labelMap[nowLabel] += 1
shannoEnt = 0.0
for key in labelMap.keys():
prop = float(labelMap[key]/m)
shannoEnt -= prop*(log(prop,2)) return shannoEnt # split the subDataSet Improve reusability
def splitDataSet(dataSet,axis,feature):
subDataSet = []
# date type
for featVec in dataSet:
if(featVec[axis] == feature):
reduceVec = featVec[:axis]
if(isinstance(reduceVec,np.ndarray)):
reduceVec = np.ndarray.tolist(reduceVec)
reduceVec.extend(featVec[axis+1:])
subDataSet.append(reduceVec)
return subDataSet # choose the best Feature to split
def chooseFeature(dataSet):
numFeature = len(dataSet[0])-1
baseEntorpy = informationEntropy(dataSet)
bestInfoGain = 0.0
bestFeature = -1 for i in range(numFeature):
#valueList = wine_data[:,i:i+1]
valueList = [value[i] for value in dataSet] # debug
# print("valueList is:")
# print(len(valueList)) uniqueVals = set(valueList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value) #debug
#print("subDataSet is :")
#print(subDataSet)
#print(len(subDataSet[0])) # 数值部分要注意
prop = len(subDataSet)/float(len(dataSet))
newEntropy += prop*informationEntropy(subDataSet) infoGain = baseEntorpy - newEntropy
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i return bestFeature def majorityCnt(classList):
classMap = {}
for vote in classList:
if vote not in classMap.keys():
classMap[vote] = 0
classMap[vote] += 1 #tempMap = sorted(classMap.items(),key = operator.itemgetter(1),reverse = True)
tempMap = sorted(classMap.items(), key=lambda x:x[1], reverse=True)
return tempMap[0][0] # labels for map of Feature
def createTree(dataSet,Featlabels):
classList = [example[-1] for example in dataSet]
# if all of the attribute of classList is same if(classList.count(classList[0])) == len(classList):
#print("all is same")
return classList[0]
# print("debug after")
# feature is empty
if len(dataSet[0]) == 1:
print("len is zero")
return majorityCnt(classList)
# print("debug pre")
bestFeat = chooseFeature(dataSet)
#debug
#print("debug")
#print(bestFeat) bestFeatLabel = Featlabels[bestFeat]
# print(bestFeatLabel)
# python tree use dict for index of feature to build the tree
myTree = {bestFeatLabel:{}} # del redundant label
del(Featlabels[bestFeat]) valueList = [example[bestFeat] for example in dataSet]
uniqueVals = set(valueList) # print(uniqueVals)
# 取值都一样的话就没有必要继续划分
if(len(uniqueVals) == 1):
return majorityCnt(dataSet) for value in uniqueVals:
#if(bestFeat == 6):
# print(value)
subFeatLabels = Featlabels[:]
# print(sublabels)
subdataSet = splitDataSet(dataSet,bestFeat,value) if(bestFeatLabel == 6 and value == '3.06'):
#print("debuging ")
myTree[bestFeatLabel][value] = createTree(subdataSet, subFeatLabels)
#print(myTree[bestFeatLabel][value])
#print("len of build")
#print(len(uniqueVals))
# print(value)
else:
myTree[bestFeatLabel][value] = createTree(subdataSet,subFeatLabels) return myTree # classity fuction featLabel and testVes is used to get featvalue of test
def classify(inputTree,featLabels,testVec):
# get the node
nowNode = list(inputTree.keys())[0] # debug
#debug(nowNode)
# print(featLabels)
featIndex = featLabels.index(nowNode) # print(featIndex)
#find the value of testVec in feature
keyValue = testVec[featIndex] #print("len of input")
#print(len(inputTree[nowNode].keys()))
keyValue = str(keyValue)
subTree = inputTree[nowNode][keyValue]
if(isinstance(subTree,dict)):
classLabel = classify(subTree,featLabels,testVec)
else:
classLabel = subTree return classLabel if __name__ == '__main__':
wine_data, featLabels = loadDateset()
#print(featLabels)
#print(wine_data)
myTree = createTree(wine_data,featLabels.copy()) #print(type(myTree))
# the type of value
test = [14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065]
#print(featLabels)
print(classify(myTree,featLabels,test))
静下来,你想要的东西才能看见
python ID3决策树实现的更多相关文章
- Python3实现机器学习经典算法(三)ID3决策树
一.ID3决策树概述 ID3决策树是另一种非常重要的用来处理分类问题的结构,它形似一个嵌套N层的IF…ELSE结构,但是它的判断标准不再是一个关系表达式,而是对应的模块的信息增益.它通过信息增益的大小 ...
- ID3决策树预测的java实现
刚才写了ID3决策树的建立,这个是通过决策树来进行预测.这里主要用到的就是XML的遍历解析,比较简单. 关于xml的解析,参考了: http://blog.csdn.net/soszou/articl ...
- python利用决策树进行特征选择
python利用决策树进行特征选择(注释部分为绘图功能),最后输出特征排序: import numpy as np import tflearn from tflearn.layers.core im ...
- Python实现决策树ID3算法
主要思想: 0.训练集格式:特征1,特征2,...特征n,类别 1.采用Python自带的数据结构字典递归的表示数据 2.ID3计算的信息增益是指类别的信息增益,因此每次都是计算类别的熵 3.ID3每 ...
- python实现决策树C4.5算法(在ID3基础上改进)
一.概论 C4.5主要是在ID3的基础上改进,ID3选择(属性)树节点是选择信息增益值最大的属性作为节点.而C4.5引入了新概念"信息增益率",C4.5是选择信息增益率最大的属性作 ...
- python 之 决策树分类算法
发现帮助新手入门机器学习的一篇好文,首先感谢博主!:用Python开始机器学习(2:决策树分类算法) J. Ross Quinlan在1975提出将信息熵的概念引入决策树的构建,这就是鼎鼎大名的ID3 ...
- python画决策树
1.安装graphviz.下载地址在:http://www.graphviz.org/.如果你是linux,可以用apt-get或者yum的方法安装.如果是windows,就在官网下载msi文件安装. ...
- ID3决策树的Java实现
package DecisionTree; import java.io.*; import java.util.*; public class ID3 { //节点类 public class DT ...
- python实现决策树
1.决策树的简介 http://www.cnblogs.com/lufangtao/archive/2013/05/30/3103588.html 2.决策是实现的伪代码 “读入训练数据” “找出每个 ...
随机推荐
- C Primer Plus--结构和其他数据类型(2)
目录 枚举类型 enumerated type 枚举默认值 为枚举指定值 命名空间 namespace typedef关键字 * () []修饰符 typedef与这三个运算符结合 函数与指针 函数指 ...
- css3学习之--伪类与圆角
随着css3.0的发布到逐渐完善,目前已经大部分浏览器已经能较好地适配,所以写一些css3的学习经历,分享心得,主要以案例讲解为主,话不多说,今天以css3的新增的“圆角”属性来讲解,基于web画一个 ...
- ES6继承小实例
ES6继承小实例 一.总结 一句话总结: js中的类和继承可以多用es6里面的,和其它后端语言的使用方法一样 class Animal { constructor(name) { this.name ...
- CMU Database Systems - Distributed OLTP & OLAP
OLTP scale-up和scale-out scale-up会有上限,无法不断up,而且相对而言,up升级会比较麻烦,所以大数据,云计算需要scale-out scale-out,就是分布式数据库 ...
- spring入门篇
- ERROR: source database "template1" is being accessed by other users
一开始,开发童鞋说他在测试环境没有创建数据库的权限.心想,不对呀,开发环境没有怎么做权限管控,明明给予授权了.上来一看: postgres=# CREATE DATABASE "abce&q ...
- C# default(T)关键字
C#关键词default函数,default(T)可以得到该类型的默认值. C#在类初始化时,会给未显示赋值的字段.属性赋上默认值,但是值变量却不会. 值变量可以使用默认构造函数赋值,或者使用defa ...
- [E2E_L9]GOMFCTemplate的融合进阶
在前面出现的融合方法中,最突出的问题就是每次运算,都需要将整个推断的过程全部操作一遍,这样肯定是费时间的--所以我们需要将能够独立的地方独立出来,但是这个过中非常容易出现溢出的错误--经过一段时间的尝 ...
- (4)Flask项目模板渲染初体验
一.准备静态资源 将项目使用到的静态资源拷贝到static目录 二.创建前台首页html 创建templates/home/home.html页面,内容包含导航和底部版权两部分,中间内容区域为模板标签 ...
- how-does-mysql-replication-really-work/ what-causes-replication-lag
https://www.cnblogs.com/kevingrace/p/6274073.html https://www.cnblogs.com/kevingrace/p/6261091.html ...