数据挖掘之决策树ID3算法(C#实现)
决策树是一种非常经典的分类器,它的作用原理有点类似于我们玩的猜谜游戏。比如猜一个动物:
问:这个动物是陆生动物吗?
答:是的。
问:这个动物有鳃吗?
答:没有。
这样的两个问题顺序就有些颠倒,因为一般来说陆生动物是没有鳃的(记得应该是这样的,如有错误欢迎指正)。所以玩这种游戏,提问的顺序很重要,争取每次都能够获得尽可能多的信息量。
AllElectronics顾客数据库标记类的训练元组 | |||||
RID | age | income | student | credit_rating | Class: buys_computer |
1 | youth | high | no | fair | no |
2 | youth | high | no | excellent | no |
3 | middle_aged | high | no | fair | yes |
4 | senior | medium | no | fair | yes |
5 | senior | low | yes | fair | yes |
6 | senior | low | yes | excellent | no |
7 | middle_aged | low | yes | excellent | yes |
8 | youth | medium | no | fair | no |
9 | youth | low | yes | fair | yes |
10 | senior | medium | yes | fair | yes |
11 | youth | medium | yes | excellent | yes |
12 | middle_aged | medium | no | excellent | yes |
13 | middle_aged | high | yes | fair | yes |
14 | senior | medium | no | excellent | no |
以AllElectronics顾客数据库标记类的训练元组为例。我们想要以这些样本为训练集,训练我们的决策树模型,以此来挖掘出顾客是否会购买电脑的决策模式。
在决策树ID3算法中,计算信息度的公式如下:
$$Info_A(D) = \sum_{j=1}^v\frac{|D_j|}{D} \times Info(D_j)$$
计算信息增益的公式如下:
$$Gain(A) = Info(D) - Info_A(D)$$
按照公式,在要进行分类的类别变量中,有5个“no”和9个“yes”,因此期望信息为:
$$Info(D)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14}=0.940$$
首先计算特征age的期望信息:
$$Info_{age}(D)=\frac{5}{14} \times (-\frac{2}{5}log_2\frac{2}{5} - \frac{3}{5}log_2\frac{3}{5})+\frac{4}{14} \times (-\frac{4}{4}log_2\frac{4}{4} - \frac{0}{4}log_2\frac{0}{4})+\frac{5}{14} \times (-\frac{3}{5}log_2\frac{3}{5} - \frac{2}{5}log_2\frac{2}{5})$$
因此,如果按照age进行划分,则获得的信息增益为:
$$Gain(age) = Info(D)-Info_{age}(D) = 0.940-0.694=0.246$$
依次计算以income、student和credit_rating来分裂的信息增益,由此选择能够带来最大信息增益的变量,在当
前结点选择以以该变量的取值进行分裂。递归地进行执行即可生成决策树。更加详细的内容可以参考:
https://en.wikipedia.org/wiki/Decision_tree
C#代码的实现如下:
using System;
using System.Collections.Generic;
using System.Linq;
namespace MachineLearning.DecisionTree
{
public class DecisionTreeID3<T> where T : IEquatable<T>
{
T[,] Data;
string[] Names;
int Category;
T[] CategoryLabels;
DecisionTreeNode<T> Root;
public DecisionTreeID3(T[,] data, string[] names, T[] categoryLabels)
{
Data = data;
Names = names;
Category = data.GetLength() - ;//类别变量需要放在最后一列
CategoryLabels = categoryLabels;
}
public void Learn()
{
int nRows = Data.GetLength();
int nCols = Data.GetLength();
int[] rows = new int[nRows];
int[] cols = new int[nCols];
for (int i = ; i < nRows; i++) rows[i] = i;
for (int i = ; i < nCols; i++) cols[i] = i;
Root = new DecisionTreeNode<T>(-, default(T));
Learn(rows, cols, Root);
DisplayNode(Root);
}
public void DisplayNode(DecisionTreeNode<T> Node, int depth = )
{
if (Node.Label != -)
Console.WriteLine("{0} {1}: {2}", new string('-', depth * ), Names[Node.Label], Node.Value);
foreach (var item in Node.Children)
DisplayNode(item, depth + );
}
private void Learn(int[] pnRows, int[] pnCols, DecisionTreeNode<T> Root)
{
var categoryValues = GetAttribute(Data, Category, pnRows);
var categoryCount = categoryValues.Distinct().Count();
if (categoryCount == )
{
var node = new DecisionTreeNode<T>(Category, categoryValues.First());
Root.Children.Add(node);
}
else
{
if (pnRows.Length == ) return;
else if (pnCols.Length == )
{
//投票~
//多数票表决制
var Vote = categoryValues.GroupBy(i => i).OrderBy(i => i.Count()).First();
var node = new DecisionTreeNode<T>(Category, Vote.First());
Root.Children.Add(node);
}
else
{
var maxCol = MaxEntropy(pnRows, pnCols);
var attributes = GetAttribute(Data, maxCol, pnRows).Distinct();
string currentPrefix = Names[maxCol];
foreach (var attr in attributes)
{
int[] rows = pnRows.Where(irow => Data[irow, maxCol].Equals(attr)).ToArray();
int[] cols = pnCols.Where(i => i != maxCol).ToArray();
var node = new DecisionTreeNode<T>(maxCol, attr);
Root.Children.Add(node);
Learn(rows, cols, node);//递归生成决策树
}
}
}
}
public double AttributeInfo(int attrCol, int[] pnRows)
{
var tuples = AttributeCount(attrCol, pnRows);
var sum = (double)pnRows.Length;
double Entropy = 0.0;
foreach (var tuple in tuples)
{
int[] count = new int[CategoryLabels.Length];
foreach (var irow in pnRows)
if (Data[irow, attrCol].Equals(tuple.Item1))
{
int index = Array.IndexOf(CategoryLabels, Data[irow, Category]);
count[index]++;//目前仅支持类别变量在最后一列
}
double k = 0.0;
for (int i = ; i < count.Length; i++)
{
double frequency = count[i] / (double)tuple.Item2;
double t = -frequency * Log2(frequency);
k += t;
}
double freq = tuple.Item2 / sum;
Entropy += freq * k;
}
return Entropy;
}
public double CategoryInfo(int[] pnRows)
{
var tuples = AttributeCount(Category, pnRows);
var sum = (double)pnRows.Length;
double Entropy = 0.0;
foreach (var tuple in tuples)
{
double frequency = tuple.Item2 / sum;
double t = -frequency * Log2(frequency);
Entropy += t;
}
return Entropy;
}
private static IEnumerable<T> GetAttribute(T[,] data, int col, int[] pnRows)
{
foreach (var irow in pnRows)
yield return data[irow, col];
}
private static double Log2(double x)
{
return x == 0.0 ? 0.0 : Math.Log(x, 2.0);
}
public int MaxEntropy(int[] pnRows, int[] pnCols)
{
double cateEntropy = CategoryInfo(pnRows);
int maxAttr = ;
double max = double.MinValue;
foreach (var icol in pnCols)
if (icol != Category)
{
double Gain = cateEntropy - AttributeInfo(icol, pnRows);
if (max < Gain)
{
max = Gain;
maxAttr = icol;
}
}
return maxAttr;
}
public IEnumerable<Tuple<T, int>> AttributeCount(int col, int[] pnRows)
{
var tuples = from n in GetAttribute(Data, col, pnRows)
group n by n into i
select Tuple.Create(i.First(), i.Count());
return tuples;
}
}
}
决策树结点的构造:
using System.Collections.Generic; namespace MachineLearning.DecisionTree
{
public sealed class DecisionTreeNode<T>
{
public int Label { get; set; }
public T Value { get; set; }
public List<DecisionTreeNode<T>> Children { get; set; }
public DecisionTreeNode(int label, T value)
{
Label = label;
Value = value;
Children = new List<DecisionTreeNode<T>>();
}
}
}
调用方法如下:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using MachineLearning.DecisionTree;
namespace MachineLearning
{
class Program
{
static void Main(string[] args)
{
var da = new string[,]
{
{"youth","high","no","fair","no"},
{"youth","high","no","excellent","no"},
{"middle_aged","high","no","fair","yes"},
{"senior","medium","no","fair","yes"},
{"senior","low","yes","fair","yes"},
{"senior","low","yes","excellent","no"},
{"middle_aged","low","yes","excellent","yes"},
{"youth","medium","no","fair","no"},
{"youth","low","yes","fair","yes"},
{"senior","medium","yes","fair","yes"},
{"youth","medium","yes","excellent","yes"},
{"middle_aged","medium","no","excellent","yes"},
{"middle_aged","high","yes","fair","yes"},
{"senior","medium","no","excellent","no"}
};
var names = new string[] { "age", "income", "student", "credit_rating", "Class: buys_computer" };
var tree = new DecisionTreeID3<string>(da, names, new string[] { "yes", "no" });
tree.Learn();
Console.ReadKey();
}
}
}
运行结果:
注:作者本人也在学习中,能力有限,如有错漏还请不吝指正。转载请注明作者。
数据挖掘之决策树ID3算法(C#实现)的更多相关文章
- 机器学习之决策树(ID3)算法与Python实现
机器学习之决策树(ID3)算法与Python实现 机器学习中,决策树是一个预测模型:他代表的是对象属性与对象值之间的一种映射关系.树中每个节点表示某个对象,而每个分叉路径则代表的某个可能的属性值,而每 ...
- 决策树ID3算法[分类算法]
ID3分类算法的编码实现 <?php /* *决策树ID3算法(分类算法的实现) */ /* *求信息增益Grain(S1,S2) */ //-------------------------- ...
- 决策树---ID3算法(介绍及Python实现)
决策树---ID3算法 决策树: 以天气数据库的训练数据为例. Outlook Temperature Humidity Windy PlayGolf? sunny 85 85 FALSE no ...
- 02-21 决策树ID3算法
目录 决策树ID3算法 一.决策树ID3算法学习目标 二.决策树引入 三.决策树ID3算法详解 3.1 if-else和决策树 3.2 信息增益 四.决策树ID3算法流程 4.1 输入 4.2 输出 ...
- 决策树ID3算法的java实现(基本试用所有的ID3)
已知:流感训练数据集,预定义两个类别: 求:用ID3算法建立流感的属性描述决策树 流感训练数据集 No. 头痛 肌肉痛 体温 患流感 1 是(1) 是(1) 正常(0) 否(0) 2 是(1) 是(1 ...
- 决策树 -- ID3算法小结
ID3算法(Iterative Dichotomiser 3 迭代二叉树3代),是一个由Ross Quinlan发明的用于决策树的算法:简单理论是越是小型的决策树越优于大的决策树. 算法归 ...
- 【Machine Learning in Action --3】决策树ID3算法
1.简单概念描述 决策树的类型有很多,有CART.ID3和C4.5等,其中CART是基于基尼不纯度(Gini)的,这里不做详解,而ID3和C4.5都是基于信息熵的,它们两个得到的结果都是一样的,本次定 ...
- 决策树ID3算法的java实现
决策树的分类过程和人的决策过程比较相似,就是先挑“权重”最大的那个考虑,然后再往下细分.比如你去看医生,症状是流鼻涕,咳嗽等,那么医生就会根据你的流鼻涕这个权重最大的症状先认为你是感冒,接着再根据你咳 ...
- 决策树ID3算法
决策树 (Decision Tree)是在已知各种情况发生概率的基础上,通过构成 决策树 来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法 ...
随机推荐
- Pyhon环境搭建-window
1.安装python3.4.3版本 地址:(64位)https://www.python.org/ftp/python/3.4.3/python-3.4.3.amd64.msi (32位)http:/ ...
- 黑苹果-IOS学习的开始
深知安装黑苹果的不易,在这里写一下关于我的Thinkpad E430c安装黑苹果教程(Mac版本:Yosemite 10.10.4),希望能够帮助有需要的朋友. 首先贴上我的电脑配置报表: ----- ...
- VB 中Sub和Function的区别
Sub可以理解为执行一个过车,一个操作. Function在执行完过后,还要返回一个结果. Sub:过程:Function:函数,可以带返回值. 语法: Sub SubName(参数1,参数2,... ...
- android studio sdk 配置
android studio在启动后会一直处于 fetching Android sdk compoment information 状态 解决办法: 按照网友提供的方法: 第一步: 1)进入刚安装的 ...
- caffe 安装资料整理
最近在安装caffe,因为过程繁琐,而且不同的作者给出了不同的安装教程,鱼龙混杂,所以做了个简单的整理. 基本安装方法在下面博客上面都有详细介绍,不过不同版本的硬件适配不同版本的软件,所以安装的时候一 ...
- java 静态代理-积木系列
代理模式的定义:Provide a surrogate or placeholder for another object to controlaccess to it(为其他对象提供一种代理以控制对 ...
- CLR via C# 3rd - 08 - Methods
Kinds of methods Constructors Type constructors Overload operators Type con ...
- 网站统计中的数据收集原理及实现(share)
转载自:http://blog.codinglabs.org/articles/how-web-analytics-data-collection-system-work.html 网站数据统计分析工 ...
- Thread1:EXC_BAD_ACCESS 错误
描述:野指针,在对象被释放之后又调用了该对象 场景:在某个UIVIewController释放之后有调用了该Controller的某些方法. 由于项目需求需要监控WebView的滚动,所以在控制器中给 ...
- Python2.7.12开发环境构建(自动补全)
一.安装readline-devel包 Python的编译安装依赖于这个包 yum -y install readline-devel 二.安装Python2.7.12 Python官方网站(到此处下 ...