数据挖掘中决策树算法实现—

数据挖掘中决策树算法实现——Bash

博客分类：

数据挖掘决策树 bash 非递归实现标准信息熵

一、决策树简介：

关于决策树，几乎是数据挖掘分类算法中最先介绍到的。

决策树，顾名思义就是用来做决定的树，一个分支就是一个决策过程。

每个决策过程中涉及一个数据的属性，而且只涉及一个。然后递归地，贪心地直到满足决策条件(即可以得到明确的决策结果)。

决策树的实现首先要有一些先验（已经知道结果的历史）数据做训练，通过分析训练数据得到每个属性对结果的影响的大小，这里我们通过一种叫做信息增益的理论去描述它，期间也涉及到熵的概念。也可参考文章信息增益与熵.

下面我们结合实例说一下决策树实现过程中的上述关键概念：

假设我们有如下数据：

age	job	house	credit	class
1	0	0	1	0
1	0	0	2	0
1	1	0	2	1
1	1	1	1	1
1	0	0	1	0
2	0	0	1	0
2	0	0	2	0
2	1	1	2	1
2	0	1	3	1
2	0	1	3	1
3	0	1	3	1
3	0	1	2	1
3	1	0	2	1
3	1	0	3	1
3	0	0	1	0

(一)

我们首先要通过计算找到哪个属性的所有属性值能更好地表达class字段的不同。通过计算，我们发现house的属性值最能表现class字段的不同。这个衡量标准其实就是信息增益。计算方法是：首先计算全部数据的熵，然后除class之外的其他属性逐个遍历，找到熵最小的那个属性(house)，然后将全部数据的熵减去按照house属性划分数据之后的数据的熵。

这个值如果满足条件假如(>0.1)，我们认为数据应该按照这个节点进行分裂，也就是说这个属性(house)构成了我们的一次决策过程。

(二)

然后

在按照house分裂的每个数据集上，针对其他属性(house除外)进行与(一)相同的过程，直到信息增益不足以满足数据分裂的条件。

这样，我们就得到了一个关于属性数据划分的一棵树。可以作为class字段未知的数据的决策依据。

二、决策树代码实现：

具体计算代码如下：---假设上述数据我们保存为descision.dat文件，以及需要bash4.0及以上支持运行。

#!/home/admin/bin/bash_bin/bash_4
input=$1;
if [ -z $input ]; then
echo "please input the traning file";
exit 1;
fi
## pre calculate the log2 value for the later calculate operation
declare -a log2;
logi=0;
records=$(cat $input | wc -l);
for i in `awk -v n=$records 'BEGIN{for(i=1;i<n;i++) print log(i)/log(2);}'`
do
((logi+=1));
log2[$logi]=$i;
done
## function for calculating the entropy for the given distribution of the class
function getEntropy {
local input=`echo $1`;
if [[ $input == *" "* ]]; then
local current_entropy=0;
local sum=0;
local i;
for i in $input
do
((sum+=$i));
current_entropy=$(awk -v n=$i -v l=${log2[$i]} -v o=$current_entropy 'BEGIN{print n*l+o}');
done
current_entropy=$(awk -v n=$current_entropy -v b=$sum -v l=${log2[$sum]} 'BEGIN{print n/b*-1+l;}')
eval $2=$current_entropy;
else
eval $2=0;
fi
}
### the header title of the input data
declare -A header_info;
header=$(head -1 $input);
headers=(${header//,/ })
length=${#headers[@]};
for((i=0;i<length;i++))
do
attr=${headers[$i]};
header_info[$attr]=$i;
done
### the data content of the input data
data=${input}_dat;
sed -n '2,$p' $input > $data
## use an array to store the information of a descision tree
## the node structure is {child,slibling,parent,attr,attr_value,leaf,class}
## the root is a virtual node with none used attribute
## only the leaf node has class flag and the "leaf,class" is meaningfull
## the descision_tree
declare -a descision_tree;
## the root node with no child\slibing and anythings else
descision_tree[0]="0:0:0:N:N:0:0";
## use recursive algrithm to build the tree
## so we need a trace_stack to record the call level infomation
declare -a trace_stack;
## push the root node into the stack
trace_stack[0]=0;
stack_deep=1;
## begin to build the tree until the trace_stack is empty
while [ $stack_deep -ne 0 ]
do
((stack_deep-=1));
current_node_index=${trace_stack[$stack_deep]};
current_node=${descision_tree[$current_node_index]};
current_node_struct=(${current_node//:/ });
## select the current data set
## get used attr and their values
attrs=${current_node_struct[3]};
attrv=${current_node_struct[4]};
declare -a grepstra=();
if [ $attrs != "N" ];then
attr=(${attrs//,/ });
attrvs=(${attrv//,/ });
attrc=${#attr[@]};
for((i=0;i<attrc;i++))
do
a=${attr[$i]};
index=${header_info[$a]};
grepstra[$index]=${attrvs[$i]};
done
fi
for((i=0;i<length;i++))
do
if [ -z ${grepstra[$i]} ]; then
grepstra[$i]=".*";
fi
done
grepstrt=${grepstra[*]};
grepstr=${grepstrt// /,};
grep $grepstr $data > current_node_data
## calculate the entropy before split the records
entropy=0;
input=`cat current_node_data | cut -d "," -f 5 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`
getEntropy "$input" entropy;
## calculate the entropy for each of the rest attrs
## and select the min one
min_attr_entropy=1;
min_attr_name="";
min_attr_index=0;
for((i=0;i<length-1;i++))
do
## just use the rest attrs
if [[ "$attrs" != *"${headers[$i]}"* ]]; then
## calculate the entropy for the current attr
### get the different values for the headers[$i]
j=$((i+1));
cut -d "," -f $j,$length current_node_data > tmp_attr_ds
dist_values=`cut -d , -f 1 tmp_attr_ds | sort | uniq -c | sed 's/^ \+//g' | sed 's/ /,/g'`;
totle=0;
totle_entropy_attr=0;
for k in $dist_values
do
info=(${k//,/ });
((totle+=${info[0]}));
cur_class_input=`grep "^${info[1]}," tmp_attr_ds | cut -d "," -f 2 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`
cur_attr_value_entropy=0;
getEntropy "$cur_class_input" cur_attr_value_entropy;
totle_entropy_attr=$(awk -v c=${info[0]} -v e=$cur_attr_value_entropy -v o=$totle_entropy_attr 'BEGIN{print c*e+o;}');
done
attr_entropy=$(awk -v e=$totle_entropy_attr -v c=$totle 'BEGIN{print e/c;}');
if [ $(echo "$attr_entropy < $min_attr_entropy" | bc) = 1 ]; then
min_attr_entropy=$attr_entropy;
min_attr_name="${headers[$i]}";
min_attr_index=$j;
fi
fi
done
## calculate the gain between the original entropy of the current node
## and the entropy after split by the attribute which has the min_entropy
gain=$(awk -v b=$entropy -v a=$min_attr_entropy 'BEGIN{print b-a;}');
## when the gain is large than 0.1 and then put it as a branch
## and add the child nodes to the current node and push the index to the trace_stack
## otherwise make it as a leaf node and get the class flag
## and do not push trace_stack
if [ $(echo "$gain > 0.1" | bc) = 1 ]; then
### get the attribute values
attr_values_str=`cut -d , -f $min_attr_index current_node_data | sort | uniq`;
attr_values=($attr_values_str);
### generate the node and add to the tree and add their index to the trace_stack
tree_store_length=${#descision_tree[@]};
current_node_struct[0]=$tree_store_length;
parent_node_index=$current_node_index;
attr_value_c=${#attr_values[@]};
for((i=0;i<attr_value_c;i++))
do
tree_store_length=${#descision_tree[@]};
slibling=0;
if [ $i -lt $((attr_value_c-1)) ]; then
slibling=$((tree_store_length+1));
fi
new_attr="";
new_attrvalue="";
if [ $attrs != "N" ]; then
new_attr="$attrs,$min_attr_name";
new_attrvalue="$attrv,${attr_values[$i]}";
else
new_attr="$min_attr_name";
new_attrvalue="${attr_values[$i]}";
fi
new_node="0:$slibling:$parent_node_index:$new_attr:$new_attr_value:0:0";
descision_tree[$tree_store_length]="$new_node";
trace_stack[$stack_deep]=$tree_store_length;
((stack_deep+=1));
done
current_node_update=${current_node_struct[*]};
descision_tree[$current_node_index]=${current_node_update// /:};
else ## current node is a leaf node
current_node_struct[5]=1;
current_node_struct[6]=`cut -d , -f $length current_node_data | sort | uniq -c | sort -n -r | head -1 | sed 's/^ \+[^ ]* //g'`;
current_node_update=${current_node_struct[*]};
descision_tree[$current_node_index]=${current_node_update// /:};
fi
## output the descision tree after every step for split or leaf node generater
echo ${descision_tree[@]};
done

执行代码：

./descision.sh descision.dat

执行结果为：

1:0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:0:0
1:0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:1:1
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:0:0
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:1:1
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:1:0 0:0:1:house,job:0,1:1:1

输出结果中展示了决策树结构生成过程的细节，以及生成过程中树的变化过程

本代码中使用了一维数组结构来存储整棵决策树，输出的先后顺序按照数组下标输出。

输出结果中最后一行表示最终的决策树。它表示的树形结构其实是：

这样看着是不是就爽多了。

说明：

关于上述决策树结果中其实有部分存在误导：

默认根节点是放在数组的第一个位置的，即索引值为0，而子节点中的child与sibling为0时并不表示指向跟节点，而是表示无意义，即没有子节点或兄弟节点。

该决策树所代表的分类规则：

根据该决策树输出，我们挖掘的规则是这样的：

首先根据house属性判断，当house属性为1时，走到索引为2的节点，此时该节点是叶子节点，预测值class为1.

当house属性为0时，接着按照job属性来判断，当job属性为0时走到索引为3的节点，预测值class为0。如果job属性为1时走到索引值为4的节点，此时预测值class为1.

数据挖掘中决策树算法实现——Bash的更多相关文章

sklearn中决策树算法DesiciontTreeClassifier()调用以及sklearn自带的数据包sklearn.datasets.load_iris()的应用
决策树方法的简单调用记录一下 clf=tree.DecisionTreeClassifier() dataMat=[];labelMat=[] dataPath='D:/machinelearning ...
ID3和C4.5分类决策树算法 - 数据挖掘算法（7）
(2017-05-18 银河统计) 决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来判断其可行性的决策分析方法,是直观运用概率分析的一种图解法.由于这种决策分支画 ...
《BI那点儿事》Microsoft 决策树算法
Microsoft 决策树算法是由 Microsoft SQL Server Analysis Services 提供的分类和回归算法,用于对离散和连续属性进行预测性建模.对于离散属性,该算法根据数据 ...
ID3决策树算法原理及C++实现(其中代码转自别人的博客)
分类是数据挖掘中十分重要的组成部分.分类作为一种无监督学习方式被广泛的使用. 之前关于"数据挖掘中十大经典算法"中,基于ID3核心思想的分类算法C4.5榜上有名.所以不难看出ID3 ...
决策树算法的Python实现—基于金融场景实操
决策树是最经常使用的数据挖掘算法,本次分享jacky带你深入浅出,走进决策树的世界基本概念决策树(Decision Tree) 它通过对训练样本的学习,并建立分类规则,然后依据分类规则,对新样本数 ...
吴裕雄--天生自然python机器学习：决策树算法
我们经常使用决策树处理分类问题’近来的调查表明决策树也是最经常使用的数据挖掘算法. 它之所以如此流行,一个很重要的原因就是使用者基本上不用了解机器学习算法,也不用深究它是如何工作的. K-近邻算法可 ...
scikit-learn决策树算法类库使用小结
之前对决策树的算法原理做了总结,包括决策树算法原理(上)和决策树算法原理(下).今天就从实践的角度来介绍决策树算法,主要是讲解使用scikit-learn来跑决策树算法,结果的可视化以及一些参数调参的 ...
4-Spark高级数据分析-第四章用决策树算法预测森林植被
预测是非常困难的,更别提预测未来. 4.1 回归简介随着现代机器学习和数据科学的出现,我们依旧把从“某些值”预测“另外某个值”的思想称为回归.回归是预测一个数值型数量,比如大小.收入和温度,而分类则 ...
就是要你明白机器学习系列--决策树算法之悲观剪枝算法(PEP)
前言在机器学习经典算法中,决策树算法的重要性想必大家都是知道的.不管是ID3算法还是比如C4.5算法等等,都面临一个问题,就是通过直接生成的完全决策树对于训练样本来说是“过度拟合”的,说白了是太精确 ...

随机推荐

使用 IntraWeb (13) - 基本控件之 TIWLabel、TIWLink、TIWURL、TIWURLWindow
TIWLabel // TIWLink //内部链接 TIWURL //外部链接 TIWURLWindow //页内框架, 就是 <iframe></iframe> TIWLa ...
c# 实现获取汉字十六进制Unicode编码字符串
1. 汉字转十六进制UNICODE编码字符串 /// <summary> /// //// /// </summary> /// & ...
IAR EWARM : Debugging with CMSIS-DAP
OpenOCD Debug Adapter Configuration
Correctly installing OpenOCD includes making your operating system give OpenOCD access to debug adap ...
AT91 USB Composite Driver Implementation
AT91 USB Composite Driver Implementation 1. Introduction The USB Composite Device is a general way t ...
linux下的c++filt 命令
我们知道, 在C++中, 是允许函数重载的, 也就引出了编译器的name mangling机制, 今天我们要介绍的c++filt命令便与此有关. 对于从事linux开发的人来说, 不可不知道c++fi ...
使用Oracle DBLink进行数据库之间对象的訪问操作
Oracle中自带了DBLink功能,它的作用是将多个oracle数据库逻辑上看成一个数据库,也就是说在一个数据库中能够操作还有一个数据库中的对象,比如我们新建了一个数据database1.我们须要操 ...
hdu1716排列2（stl:next_permutation+优先队列）
排列2 Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submis ...
android studio每次启动都要在fetching Android sdk compoment information停好久怎么解决？
网上有人给出了方案:1)进入刚安装的Android Studio目录下的bin目录.找到idea.properties文件,用文本编辑器打开.2)在idea.properties文件末尾添加一行: d ...
java容器HashMap原理
1.为什么需要HashMap 前面我们说了ArrayList和LinkedList,它们对容器内的对象都能实现增.删.改.查.遍历等操作, 并且对应不同的情况,我们可以选择不同的List,用以提高效率 ...

数据挖掘中 决策树算法实现——Bash

数据挖掘中 决策树算法实现——Bash

数据挖掘中 决策树算法实现——Bash的更多相关文章

随机推荐

热门专题

数据挖掘中决策树算法实现——Bash

数据挖掘中决策树算法实现——Bash

数据挖掘中决策树算法实现——Bash的更多相关文章