数据挖掘之决策树ID3算法（C#实现）

决策树是一种非常经典的分类器，它的作用原理有点类似于我们玩的猜谜游戏。比如猜一个动物：

问：这个动物是陆生动物吗？

答：是的。

问：这个动物有鳃吗？

答：没有。

这样的两个问题顺序就有些颠倒，因为一般来说陆生动物是没有鳃的（记得应该是这样的，如有错误欢迎指正）。所以玩这种游戏，提问的顺序很重要，争取每次都能够获得尽可能多的信息量。

AllElectronics顾客数据库标记类的训练元组
RID	age	income	student	credit_rating	Class: buys_computer
1	youth	high	no	fair	no
2	youth	high	no	excellent	no
3	middle_aged	high	no	fair	yes
4	senior	medium	no	fair	yes
5	senior	low	yes	fair	yes
6	senior	low	yes	excellent	no
7	middle_aged	low	yes	excellent	yes
8	youth	medium	no	fair	no
9	youth	low	yes	fair	yes
10	senior	medium	yes	fair	yes
11	youth	medium	yes	excellent	yes
12	middle_aged	medium	no	excellent	yes
13	middle_aged	high	yes	fair	yes
14	senior	medium	no	excellent	no

以AllElectronics顾客数据库标记类的训练元组为例。我们想要以这些样本为训练集，训练我们的决策树模型，以此来挖掘出顾客是否会购买电脑的决策模式。

在决策树ID3算法中，计算信息度的公式如下：

$$Info_A(D) = \sum_{j=1}^v\frac{|D_j|}{D} \times Info(D_j)$$

计算信息增益的公式如下：

$$Gain(A) = Info(D) - Info_A(D)$$

按照公式，在要进行分类的类别变量中，有5个“no”和9个“yes”，因此期望信息为：

$$Info(D)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14}=0.940$$

首先计算特征age的期望信息：

$$Info_{age}(D)=\frac{5}{14} \times (-\frac{2}{5}log_2\frac{2}{5} - \frac{3}{5}log_2\frac{3}{5})+\frac{4}{14} \times (-\frac{4}{4}log_2\frac{4}{4} - \frac{0}{4}log_2\frac{0}{4})+\frac{5}{14} \times (-\frac{3}{5}log_2\frac{3}{5} - \frac{2}{5}log_2\frac{2}{5})$$

因此，如果按照age进行划分，则获得的信息增益为：

$$Gain(age) = Info(D)-Info_{age}(D) = 0.940-0.694=0.246$$

依次计算以income、student和credit_rating来分裂的信息增益，由此选择能够带来最大信息增益的变量，在当

前结点选择以以该变量的取值进行分裂。递归地进行执行即可生成决策树。更加详细的内容可以参考：

https://en.wikipedia.org/wiki/Decision_tree

C#代码的实现如下：

 using System;

 using System.Collections.Generic;

 using System.Linq;

 namespace MachineLearning.DecisionTree

 {

     public class DecisionTreeID3<T> where T : IEquatable<T>

     {

         T[,] Data;

         string[] Names;

         int Category;

         T[] CategoryLabels;

         DecisionTreeNode<T> Root;

         public DecisionTreeID3(T[,] data, string[] names, T[] categoryLabels)

         {

             Data = data;

             Names = names;

             Category = data.GetLength() - ;//类别变量需要放在最后一列

             CategoryLabels = categoryLabels;

         }

         public void Learn()

         {

             int nRows = Data.GetLength();

             int nCols = Data.GetLength();

             int[] rows = new int[nRows];

             int[] cols = new int[nCols];

             for (int i = ; i < nRows; i++) rows[i] = i;

             for (int i = ; i < nCols; i++) cols[i] = i;

             Root = new DecisionTreeNode<T>(-, default(T));

             Learn(rows, cols, Root);

             DisplayNode(Root);

         }

         public void DisplayNode(DecisionTreeNode<T> Node, int depth = )

         {

             if (Node.Label != -)

                 Console.WriteLine("{0} {1}: {2}", new string('-', depth * ), Names[Node.Label], Node.Value);

             foreach (var item in Node.Children)

                 DisplayNode(item, depth + );

         }

         private void Learn(int[] pnRows, int[] pnCols, DecisionTreeNode<T> Root)

         {

             var categoryValues = GetAttribute(Data, Category, pnRows);

             var categoryCount = categoryValues.Distinct().Count();

             if (categoryCount == )

             {

                 var node = new DecisionTreeNode<T>(Category, categoryValues.First());

                 Root.Children.Add(node);

             }

             else

             {

                 if (pnRows.Length == ) return;

                 else if (pnCols.Length == )

                 {

                     //投票～

                     //多数票表决制

                     var Vote = categoryValues.GroupBy(i => i).OrderBy(i => i.Count()).First();

                     var node = new DecisionTreeNode<T>(Category, Vote.First());

                     Root.Children.Add(node);

                 }

                 else

                 {

                     var maxCol = MaxEntropy(pnRows, pnCols);

                     var attributes = GetAttribute(Data, maxCol, pnRows).Distinct();

                     string currentPrefix = Names[maxCol];

                     foreach (var attr in attributes)

                     {

                         int[] rows = pnRows.Where(irow => Data[irow, maxCol].Equals(attr)).ToArray();

                         int[] cols = pnCols.Where(i => i != maxCol).ToArray();

                         var node = new DecisionTreeNode<T>(maxCol, attr);

                         Root.Children.Add(node);

                         Learn(rows, cols, node);//递归生成决策树

                     }

                 }

             }

         }

         public double AttributeInfo(int attrCol, int[] pnRows)

         {

             var tuples = AttributeCount(attrCol, pnRows);

             var sum = (double)pnRows.Length;

             double Entropy = 0.0;

             foreach (var tuple in tuples)

             {

                 int[] count = new int[CategoryLabels.Length];

                 foreach (var irow in pnRows)

                     if (Data[irow, attrCol].Equals(tuple.Item1))

                     {

                         int index = Array.IndexOf(CategoryLabels, Data[irow, Category]);

                         count[index]++;//目前仅支持类别变量在最后一列

                     }

                 double k = 0.0;

                 for (int i = ; i < count.Length; i++)

                 {

                     double frequency = count[i] / (double)tuple.Item2;

                     double t = -frequency * Log2(frequency);

                     k += t;

                 }

                 double freq = tuple.Item2 / sum;

                 Entropy += freq * k;

             }

             return Entropy;

         }

         public double CategoryInfo(int[] pnRows)

         {

             var tuples = AttributeCount(Category, pnRows);

             var sum = (double)pnRows.Length;

             double Entropy = 0.0;

             foreach (var tuple in tuples)

             {

                 double frequency = tuple.Item2 / sum;

                 double t = -frequency * Log2(frequency);

                 Entropy += t;

             }

             return Entropy;

         }

         private static IEnumerable<T> GetAttribute(T[,] data, int col, int[] pnRows)

         {

             foreach (var irow in pnRows)

                 yield return data[irow, col];

         }

         private static double Log2(double x)

         {

             return x == 0.0 ? 0.0 : Math.Log(x, 2.0);

         }

         public int MaxEntropy(int[] pnRows, int[] pnCols)

         {

             double cateEntropy = CategoryInfo(pnRows);

             int maxAttr = ;

             double max = double.MinValue;

             foreach (var icol in pnCols)

                 if (icol != Category)

                 {

                     double Gain = cateEntropy - AttributeInfo(icol, pnRows);

                     if (max < Gain)

                     {

                         max = Gain;

                         maxAttr = icol;

                     }

                 }

             return maxAttr;

         }

         public IEnumerable<Tuple<T, int>> AttributeCount(int col, int[] pnRows)

         {

             var tuples = from n in GetAttribute(Data, col, pnRows)

                          group n by n into i

                          select Tuple.Create(i.First(), i.Count());

             return tuples;

         }

     }

 }

决策树结点的构造：

 using System.Collections.Generic;

 namespace MachineLearning.DecisionTree

 {

     public sealed class DecisionTreeNode<T>

     {

         public int Label { get; set; }

         public T Value { get; set; }

         public List<DecisionTreeNode<T>> Children { get; set; }

         public DecisionTreeNode(int label, T value)

         {

             Label = label;

             Value = value;

             Children = new List<DecisionTreeNode<T>>();

         }

     }

 }

调用方法如下：

 using System;

 using System.Collections.Generic;

 using System.Linq;

 using System.Text;

 using System.Threading.Tasks;

 using MachineLearning.DecisionTree;

 namespace MachineLearning

 {

     class Program

     {

         static void Main(string[] args)

         {

             var da = new string[,]

             {

                 {"youth","high","no","fair","no"},

                 {"youth","high","no","excellent","no"},

                 {"middle_aged","high","no","fair","yes"},

                 {"senior","medium","no","fair","yes"},

                 {"senior","low","yes","fair","yes"},

                 {"senior","low","yes","excellent","no"},

                 {"middle_aged","low","yes","excellent","yes"},

                 {"youth","medium","no","fair","no"},

                 {"youth","low","yes","fair","yes"},

                 {"senior","medium","yes","fair","yes"},

                 {"youth","medium","yes","excellent","yes"},

                 {"middle_aged","medium","no","excellent","yes"},

                 {"middle_aged","high","yes","fair","yes"},

                 {"senior","medium","no","excellent","no"}

             };

             var names = new string[] { "age", "income", "student", "credit_rating", "Class: buys_computer" };

             var tree = new DecisionTreeID3<string>(da, names, new string[] { "yes", "no" });

             tree.Learn();

             Console.ReadKey();

         }

     }

 }

运行结果：

注：作者本人也在学习中，能力有限，如有错漏还请不吝指正。转载请注明作者。

数据挖掘之决策树ID3算法（C#实现）的更多相关文章

机器学习之决策树(ID3)算法与Python实现
机器学习之决策树(ID3)算法与Python实现机器学习中,决策树是一个预测模型:他代表的是对象属性与对象值之间的一种映射关系.树中每个节点表示某个对象,而每个分叉路径则代表的某个可能的属性值,而每 ...
决策树ID3算法[分类算法]
ID3分类算法的编码实现 <?php /* *决策树ID3算法(分类算法的实现) */ /* *求信息增益Grain(S1,S2) */ //-------------------------- ...
决策树---ID3算法（介绍及Python实现）
决策树---ID3算法决策树: 以天气数据库的训练数据为例. Outlook Temperature Humidity Windy PlayGolf? sunny 85 85 FALSE no ...
02-21 决策树ID3算法
目录决策树ID3算法一.决策树ID3算法学习目标二.决策树引入三.决策树ID3算法详解 3.1 if-else和决策树 3.2 信息增益四.决策树ID3算法流程 4.1 输入 4.2 输出 ...
决策树ID3算法的java实现(基本试用所有的ID3)
已知:流感训练数据集,预定义两个类别: 求:用ID3算法建立流感的属性描述决策树流感训练数据集 No. 头痛肌肉痛体温患流感 1 是(1) 是(1) 正常(0) 否(0) 2 是(1) 是(1 ...
决策树 -- ID3算法小结
ID3算法(Iterative Dichotomiser 3 迭代二叉树3代),是一个由Ross Quinlan发明的用于决策树的算法:简单理论是越是小型的决策树越优于大的决策树. 算法归 ...
【Machine Learning in Action --3】决策树ID3算法
1.简单概念描述决策树的类型有很多,有CART.ID3和C4.5等,其中CART是基于基尼不纯度(Gini)的,这里不做详解,而ID3和C4.5都是基于信息熵的,它们两个得到的结果都是一样的,本次定 ...
决策树ID3算法的java实现
决策树的分类过程和人的决策过程比较相似,就是先挑“权重”最大的那个考虑,然后再往下细分.比如你去看医生,症状是流鼻涕,咳嗽等,那么医生就会根据你的流鼻涕这个权重最大的症状先认为你是感冒,接着再根据你咳 ...
决策树ID3算法
决策树 (Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法 ...

随机推荐

chrome调试
今天对chrome调试又进行了系统的学习. Chrome调试工具developer tool技巧把以前没有使用过的功能列举一遍. 伪类样式调试:伪类样式一般不显示出来,比如像调试元素hover的样式 ...
C#常用操作类库五（电脑操作类）
/// <summary> /// Computer Information /// </summary> public class ComputerHelper { publ ...
c#中方法的重载
转自:http://www.cnblogs.com/lovesong_blog/articles/1416617.html string和program都是Object的派生类,string类型是se ...
MVC+MQ+WinServices+Lucene.Net Demo
前言: 我之前没有接触过Lucene.Net相关的知识,最近在园子里看到很多大神在分享这块的内容,深受启发.秉着“实践出真知”的精神,再结合公司项目的实际情况,有了写一个Demo的想法,算是对自己能力 ...
android实现控制视频播放次数
android实现控制视频播放次数,实质就是每个视频片段播放完后,通过MediaPlayer设置监听器setOnCompletionListener监听视频播放完毕,用Handler发送消息再次激活视 ...
web安全之sql注入布尔注入
条件: 当一个页面,存在注入,没显示位,没有数据库出错信息,只能通过页面返回正常不正常进行判断进行sql注入. 了解的函数 exists() 用于检查子查询是 ...
什么是html技术
HTML(Hyper Text Mark-up Language )即超文本标记语言,是 WWW 的描述语言,由 Tim Berners-lee提出.设计 HTML 语言的目的是为了能把存放在一台电脑 ...
我是如何社工TDbank获取朋友隐私的
原创 ziwen@beebeeto 转载请保留本行个人感觉国外的安全方面对社工的了解和防范并不是很好即使他们使用社工的时间比我们要长很多比如他们的visa在pos机上使用是不需要密码的而且 ...
JQUERY中 .each()的用法。
.each()方法的两个简单例子用法1. <script type="text/javascript"> $("#dianji").click(f ...
Codeforces Round #169 (Div. 2)
A. Lunch Rush 模拟. B. Little Girl and Game 因为可以打乱顺序,所以只关心每种数字打奇偶性. 若一开始就是回文,即奇数字母为0或1种,则先手获胜. 若奇数字母大于 ...

数据挖掘之决策树ID3算法（C#实现）

数据挖掘之决策树ID3算法（C#实现）的更多相关文章

随机推荐

热门专题