使用Decision Tree对MNIST数据集进行实验

使用的Decision Tree中，对MNIST中的灰度值进行了0/1处理，方便来进行分类和计算熵。

使用较少的测试数据测试了在对灰度值进行多分类的情况下，分类结果的正确率如何。实验结果如下。

#Test change pixel data into more categories than 0/1:
#int(pixel)/50: 37%
#int(pixel)/64: 45.9%
#int(pixel)/96: 52.3%
#int(pixel)/128: 62.48%
#int(pixel)/152: 59.1%
#int(pixel)/176: 57.6%
#int(pixel)/192: 54.0%

可见，在对灰度数据进行二分类，也就是0/1处理时，效果是最好的。

使用0/1处理，最终结果如下：

#Result:
#Train with 10k, test with 60k: 77.79%
#Train with 60k, test with 10k: 87.3%
#Time cost: 3 hours.

最终结果是87.3%的正确率。与SVM和KNN的超过95%相比，差距不小。而且消耗时间更长。

需要注意的是，此次Decision Tree算法中，并未对决策树进行剪枝。因此，还有可以提升的空间。

python代码见最下面。其中：

calcShannonEntropy(dataSet)：是对矩阵的熵进行计算，根据各个数据点的分类情况，使用香农定理计算；

splitDataSet(dataSet, axis, value): 是获取第axis维度上的值为value的所有行所组成的矩阵。对于第axis维度上的数据，分别计算他们的splitDataSet的矩阵的熵，并与该维度上数据的出现概率相乘求和，可以得到使用第axis维度构建决策树后，整体的熵。

chooseBestFeatureToSplit(dataSet): 根据splitDataSet函数，对比得到整体的熵与原矩阵的熵相比，熵的增量最大的维度。根据此维度feature来构建决策树。

createDecisionTree(dataSet, features): 递归构建决策树。若在叶子节点处没法分类，则采用majorityCnt(classList)方法统计出现最多次的class作为分类。

代码如下：

#Decision tree for MNIST dataset by arthur503.
#Data format: 'class label1:pixel label2:pixel ...'
#Warning: without fix overfitting!
#
#Test change pixel data into more categories than 0/1:
#int(pixel)/50: 37%
#int(pixel)/64: 45.9%
#int(pixel)/96: 52.3%
#int(pixel)/128: 62.48%
#int(pixel)/152: 59.1%
#int(pixel)/176: 57.6%
#int(pixel)/192: 54.0%
#
#Result:
#Train with 10k, test with 60k: 77.79%
#Train with 60k, test with 10k: 87.3%
#Time cost: 3 hours.
from numpy import *
import operator
def calcShannonEntropy(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featureVec in dataSet:
currentLabel = featureVec[0]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 1
else:
labelCounts[currentLabel] += 1
shannonEntropy = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEntropy -= prob * log2(prob)
return shannonEntropy
#get all rows whose axis item equals value.
def splitDataSet(dataSet, axis, value):
subDataSet = []
for featureVec in dataSet:
if featureVec[axis] == value:
reducedFeatureVec = featureVec[:axis]
reducedFeatureVec.extend(featureVec[axis+1:]) #if axis == -1, this will cause error!
subDataSet.append(reducedFeatureVec)
return subDataSet
def chooseBestFeatureToSplit(dataSet):
#Notice: Actucally, index 0 of numFeatures is not feature(it is class label).
numFeatures = len(dataSet[0])
baseEntropy = calcShannonEntropy(dataSet)
bestInfoGain = 0.0
bestFeature = numFeatures - 1 #DO NOT use -1! or splitDataSet(dataSet, -1, value) will cause error!
#feature index start with 1(not 0)!
for i in range(numFeatures)[1:]:
featureList = [example[i] for example in dataSet]
featureSet = set(featureList)
newEntropy = 0.0
for value in featureSet:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEntropy(subDataSet)
infoGain = baseEntropy - newEntropy
if infoGain > bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
return bestFeature
#classify on leaf of decision tree.
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount:
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
#Create Decision Tree.
def createDecisionTree(dataSet, features):
print 'create decision tree... length of features is:'+str(len(features))
classList = [example[0] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeatureIndex = chooseBestFeatureToSplit(dataSet)
bestFeatureLabel = features[bestFeatureIndex]
myTree = {bestFeatureLabel:{}}
del(features[bestFeatureIndex])
featureValues = [example[bestFeatureIndex] for example in dataSet]
featureSet = set(featureValues)
for value in featureSet:
subFeatures = features[:]
myTree[bestFeatureLabel][value] = createDecisionTree(splitDataSet(dataSet, bestFeatureIndex, value), subFeatures)
return myTree
def line2Mat(line):
mat = line.strip().split(' ')
for i in range(len(mat)-1):
pixel = mat[i+1].split(':')[1]
#change MNIST pixel data into 0/1 format.
mat[i+1] = int(pixel)/128
return mat
#return matrix as a list(instead of a matrix).
#features is the 28*28 pixels in MNIST dataset.
def file2Mat(fileName):
f = open(fileName)
lines = f.readlines()
matrix = []
for line in lines:
mat = line2Mat(line)
matrix.append(mat)
f.close()
print 'Read file '+str(fileName) + ' to array done! Matrix shape:'+str(shape(matrix))
return matrix
#Classify test file.
def classify(inputTree, featureLabels, testVec):
firstStr = inputTree.keys()[0]
secondDict = inputTree[firstStr]
featureIndex = featureLabels.index(firstStr)
predictClass = '-1'
for key in secondDict.keys():
if testVec[featureIndex] == key:
if type(secondDict[key]) == type({}):
predictClass = classify(secondDict[key], featureLabels, testVec)
else:
predictClass = secondDict[key]
return predictClass
def classifyTestFile(inputTree, featureLabels, testDataSet):
rightCnt = 0
for i in range(len(testDataSet)):
classLabel = testDataSet[i][0]
predictClassLabel = classify(inputTree, featureLabels, testDataSet[i])
if classLabel == predictClassLabel:
rightCnt += 1
if i % 200 == 0:
print 'num '+str(i)+'. ratio: ' + str(float(rightCnt)/(i+1))
return float(rightCnt)/len(testDataSet)
def getFeatureLabels(length):
strs = []
for i in range(length):
strs.append('#'+str(i))
return strs
#Normal file
trainFile = 'train_60k.txt'
testFile = 'test_10k.txt'
#Scaled file
#trainFile = 'train_60k_scale.txt'
#testFile = 'test_10k_scale.txt'
#Test file
#trainFile = 'test_only_1.txt'
#testFile = 'test_only_2.txt'
#train decision tree.
dataSet = file2Mat(trainFile)
#Actually, the 0 item is class, not feature labels.
featureLabels = getFeatureLabels(len(dataSet[0]))
print 'begin to create decision tree...'
myTree = createDecisionTree(dataSet, featureLabels)
print 'create decision tree done.'
#predict with decision tree.
testDataSet = file2Mat(testFile)
featureLabels = getFeatureLabels(len(testDataSet[0]))
rightRatio = classifyTestFile(myTree, featureLabels, testDataSet)
print 'total right ratio: ' + str(rightRatio)

使用Decision Tree对MNIST数据集进行实验的更多相关文章

使用libsvm对MNIST数据集进行实验
使用libsvm对MNIST数据集进行实验在学SVM中的实验环节,老师介绍了libsvm的使用.当时看完之后感觉简单的说不出话来. 1. libsvm介绍虽然原理要求很高的数学知识等,但是libs ...
使用libsvm对MNIST数据集进行实验---浅显易懂！
原文:http://blog.csdn.net/arthur503/article/details/19974057 在学SVM中的实验环节,老师介绍了libsvm的使用.当时看完之后感觉简单的说不出 ...
使用KNN对MNIST数据集进行实验
由于KNN的计算量太大,还没有使用KD-tree进行优化,所以对于60000训练集,10000测试集的数据计算比较慢.这里只是想测试观察一下KNN的效果而已,不调参. K选择之前看过貌似最好不要超过2 ...
决策树Decision Tree 及实现
Decision Tree 及实现标签: 决策树熵信息增益分类有监督 2014-03-17 12:12 15010人阅读评论(41) 收藏举报分类: Data Mining(25) Pyt ...
用于分类的决策树(Decision Tree)-ID3 C4.5
决策树(Decision Tree)是一种基本的分类与回归方法(ID3.C4.5和基于 Gini 的 CART 可用于分类,CART还可用于回归).决策树在分类过程中,表示的是基于特征对实例进行划分, ...
(转)Decision Tree
Decision Tree:Analysis 大家有没有玩过猜猜看(Twenty Questions)的游戏?我在心里想一件物体,你可以用一些问题来确定我心里想的这个物体:如是不是植物?是否会飞?能游 ...
从零到一：caffe-windows(CPU)配置与利用mnist数据集训练第一个caffemodel
一.前言本文会详细地阐述caffe-windows的配置教程.由于博主自己也只是个在校学生,目前也写不了太深入的东西,所以准备从最基础的开始一步步来.个人的计划是分成配置和运行官方教程,利用自己的数 ...
CART分类与回归树与GBDT(Gradient Boost Decision Tree)
一.CART分类与回归树资料转载: http://dataunion.org/5771.html Classification And Regression Tree(CART)是决策 ...
class-决策树Decision Tree
顾名思义,决策树model是树形结构,在分类中,表示基于特征对实例进行分类的过程.可以认为是"if-else"的合集,也可以认为是特征空间,类空间上条件概率分布.主要优点是分类速度 ...

随机推荐

DG - logical standby switchover切换过程
从11g起,主库和逻辑备库之间切换不再需要关闭任何数据库实例. 1.检查主数据库是否处于考虑切换状态 SQL> select switchover_status from v$database; ...
数据存储之plist、偏好设置
// 偏好设置--------------------------------- // 存储基本类型数据 NSUserDefaults *defaults = [NSUserDefaults stan ...
docker offical docs:Working with Containers
enough ---------------------------------------------------------------------------------- Working wi ...
redhat 6.7 安装nvidia显卡驱动时出现的问题
一.给Redhat装Nvidia驱动时,出现类似ERROR: The Nouveau kernel driver is currently in use by your system. 的错误,这是应 ...
阶乘之和 & 程序运行时间 & 算法分析
实例:输入n,计算S = 1! + 2! + 3! + 4! + ... + n!的末六位(不含前导0).其中 n ≤ 106. 分析:考虑到数据溢出后程序如下: #include <stdio ...
Lintcode: Maximum Subarray II
Given an array of integers, find two non-overlapping subarrays which have the largest sum. The numbe ...
自己使用Fresco时遇到的相关问题
Fresco是facebook推出的一款强大的android图片处理库,github地址:https://github.com/facebook/fresco 里面有官方的使用配置文档,而且是中文的. ...
Maven2的配置文件settings.xml(转)
当Maven运行过程中的各种配置,例如pom.xml,不想绑定到一个固定的project或者要分配给用户时,我们使用settings.xml中的settings元素来确定这些配置.这包含了本地仓库位置 ...
C++之路进阶——codevs3287(货车运输)
3287 货车运输 2013年NOIP全国联赛提高组时间限制: 1 s 空间限制: 128000 KB 题目等级 : 钻石 Diamond 题目描述 Description A 国有 n ...
sql语句删除数据表重复字段的方法
大家都可能遇到字段重复的情况,网上很多人在找方法,也给出了一些方法,但是有的方法是误导大家,铁牛写出以下方法,方便大家使用 1.通过group by把重复的字段筛选出来,并建立临时表tmp 1 cre ...

使用Decision Tree对MNIST数据集进行实验

使用Decision Tree对MNIST数据集进行实验的更多相关文章

随机推荐

热门专题