We are now going to consider a rather different approach to machine learning, starting with one of the most common and powerful data structures in the whole of computer science: the binary tree. The computational cost of making the tree is fairly low, but the cost of using it is even lower:

For these reasons, classification by decision trees has grown in popularity over recent years. You are very likely to have been subjected to decision trees if you've ever phoned a helpline, for example for computer faults. The phone operators are guided through the decision treed by your answers to their questions.

The idea of a decision tree is that we break classification down into a set of choices about each feature in turn, starting at the root (base) of the tree and progressing down to the leaves, where we receive the classification decision. The trees are very easy to understand, and can even be turned into a set of if-then rules, suitable for use in a rule induction system.

In terms of optimisation and search, decision trees use a greedy heuristic to perform search, evaluating the possible options at the current stage of learning and making the one that seems optimal at that point. This works well a surprisingly large amount of the time.

12.1 USING DECISION TREES

As a student it can be difficult to decide what to do in the evening. There are four things that you actually quite enjoy doing, or have to do: going to the pub, watching TV, going to a party, or even (gasp) studying. The choice is sometimes made for you – if you have an assignment due the next day, then you need to study, if you are feeling lazy then the pub isn't for you, and if there isn't party then you can't go to it. You are looking for a nice algorithm that will let you decide what to do each evening without having to think about it every night.

Each evening you start at the top (root) of the tree and check whether any of your friends know about a party that night. If there is one, then you need to go, regardless. Only there is not a party do you worry about whether or not you have an assignment deadline coming up. If there is a crucial deadline, then you have to study, but if there is nothing that is urgent for the next few days, you think about how you feel. A sudden burst of energy might make you study, but otherwise you'll be slumped in front of the TV indulging your secret love of Shortland Street (or other soap opera of your choice) rather than studying. Of course, near the start of the semester when there are no assignments to do, and you are feeling rich, you'll be in the pub.

One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if … then rules) that then go into program code very simply - the first part of the tree above can be turned into:

ifthere is a partythengo to it

ifthere is not a party andyou have an urgent deadlinethenstudy

etc.

12.2 CONSTRUCTING DECISION TREES

In the example above, the three features that we need for the algorithm are the stated of your energy level, the date of your nearest deadline, and whether or not there is a party tonight. The question we need to ask is how, based on those features, we can construct the tree. There are a few different decision tree algorithms, but they are almost all variants of the same principle: the algorithms build the tree in a greedy manner starting at the root, choosing the most informative feature at each step. We are going to start by focusing on the most common: Quinlan's ID3, although we'll also mention its extension, known as C4.5, and another known as CART.

There was an important word hidden in the sentence above about how the trees work, which was informative. Choosing which feature to use next in the decision tree can be thought of as playing the game '20 Questions', where you try to elicit the item your opponent is thinking about by asking questions about it. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would as 'Is it an animal?' before you ask 'Is it a cat?'. The idea is to quantify this question of how much information is provided to you by knowing certain facts. Encoding this mathematically is the task of information theory.

12.2.1 Quick Aside: Entropy in Information Theory

Information theory was 'born' in 1948 when Claude Shannon published a paper called “A Mathematical Theory of Communication”. In that paper, he proposed the measure of information entropy, which describes the amount of impurity in a set of features. The entropy

where the logarithm is base

For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.

Information theory is a very interesting subject. It is possible to download Shannon's 1948 paper from the Internet, and also to find many resources showing where it has been applied. There are now while journals devoted to information theory because it is relevant to so many areas such as computer and telecommunication networks, machine learning, and data storage.

12.2.2 ID3

Now that we have a suitable measure for choosing which feature to choose next, entropy, we just have to work out how to apply it. The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as the entropy of the whole set minus the entropy when a particular feature is chosen. This is defined by (where

As an example, suppose that we have data (with outcomes)

The ID3 algorithm computes this information gain for each feature and chooses the one that produces the highest value. In essence, that is all there is to the algorithm. It searches the space of possible trees in a greedy way by choosing the feature with the highest information gain at each stage. The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves. As with any tree in computer science, it can be constructed recursively. At each stage the best feature is selected and then remove from the dataset, and the algorithm is recursively called on the rest. The recursion stops when either there is only one class remaining in the data (in which case a leaf is added with that class as its label), or there are no features left, when the most common label in the remaining data is used.

The ID3 Algorithm

If all examples have the same label:

- return a leaf with that label
Else if there are no features left to test:

- return a leaf with the most common label
Else:

- choose the feature

- add a branch from the node for each possible value

- for each branch:

calculate
recursively call the algorithm with

Owing to the focus on classification for real-world examples, trees are often used with text features rather than numeric values.

12.3 CLASSIFICATION AND REGRESSION TREES (CART)

There is another well-known tree-based algorithm, CART, whose name indicates that it can be used for both classification and regression. Classification is not wildly different in CART, although it is usually constrained to construct binary trees. This might seem odd at first, but there are sound computer science reasons why binary trees are good, as suggested in the computation cost discussion above, and it is not a real limitation. Even in the example that we started the chapter with, we can always turn questions into binary decisions by splitting the question up a little. Thus, a question that has three answers (say the question about when your nearest assignment deadline is, which is either 'urgent', 'near' or 'none') can be split into two questions: first, 'is the deadline urgent?', and then if the answer to that is 'no', second 'is the deadline near?' The only real difference with classification in CART is that a different information measure is commonly used.

12.3.1 Gini Impurity

The entropy that was used in ID3 as the information measure is not the only way to pick features. Another possibility is something known as the Gini impurity. The 'impurity' in the name suggests that the aim of the decision tree is to have each leaf node represent a set of datapoints that are in the same class, so that there are no mismatches. This is known as purity. If a leaf is pure then all of the training data within it have just one class. In which case, if we count the number of datapoints at the node (or better, the fraction of the number of datapoints) that belong to a class

where

Either way, the Gini impurity is equivalent to computing the expected error rate if the classification was picked according to the class distribution. The information gain can then be measured in the same way, subtracting each value

The information measure can be changed in another way, which is to add a weight to the misclassificaitons. The idea is to consider the cost of misclassifying an instance of class

12.3.2 Regression in Trees

The new part about CART is its application in regression. While it might seem strange to use trees for regression, it turns out to require only a simple modification to the algorithm. Suppose that the outputs are continuous, so that a regression model is appropriate. None of the node impurity measures that we have considered so far will work. Instead, we'll go back to our old favorite - the sum-of-squares error. To evaluate the choice of which feature to use next, we also need to find the value at which to split the database according to that feature. Remember that the output is a value at each leaf. In general, this is just a constant value for the output, computed as the mean average of all the datapoints that are situated in that leaf. This is the optimal choice in order to minimize the sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by choosing it to minimize the sum-of-squares error. We can then pick the feature that has the split point that provides the best sum-of-squares error, and continue to use the algorithm as for classification.

Learning with Trees的更多相关文章

Awesome Deep Vision
Awesome Deep Vision A curated list of deep learning resources for computer vision, inspired by awes ...
Seven Python Tools All Data Scientists Should Know How to Use
Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...
Machine Learning Methods: Decision trees and forests
Machine Learning Methods: Decision trees and forests This post contains our crib notes on the basics ...
【Machine Learning】决策树案例：基于python的商品购买能力预测系统
决策树在商品购买能力预测案例中的算法实现作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...
【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
【深度学习Deep Learning】资料大全
最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books by Yoshua Bengio, Ian Goodfellow and Aaron C ...
[Machine Learning & Algorithm] 随机森林（Random Forest）
1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来 ...
[Machine Learning] 国外程序员整理的机器学习资源大全
本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C ...
Exploratory Undersampling for Class-Imbalance Learning
Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...

随机推荐

三个 DAL 相关的Java代码小工具
最近在做 DAL (Data Access Layer 数据访问层) 的服务化,发现有不少地方是人工编写比较繁琐的,因此写了几个小工具来完成. 1. 从 DAO 类自动生成 CoreService ...
MySQL and Postgres command equivalents (mysql vs psql)
MySQL and Postgres command equivalents (mysql vs psql) 博客分类: Database From: http://blog.endpoint.c ...
JSP页面嵌套
项目中审批过程需要将业务表单嵌套在审批的页面中.由于业务表单很多,前台已经axjx到了本次选择的表单的地址.本来做的就是把这个链接放在审批页面上,但现在需求的就是直接把这个biz表单嵌套在审批的页面中 ...
基于Opencv和Mfc的图像处理增强库GOCVHelper(索引)
GOCVHelper(GreenOpen Computer Version Helper )是我在这几年编写图像处理程序的过程中积累下来的函数库.主要是对Opencv的适当扩展和在实现Mfc程序时候的 ...
Unity3D UGUI学习系列索引(暂未完成)
U3D UGUI学习1 - 层级环境 U3D UGUI学习2 - Canvas U3D UGUI学习3 - RectTransform U3D UGUI学习4 - Text U3D UGUI学习5 - ...
Auty自动化测试框架第六篇——垃圾代码回收、添加suite支持
[本文出自天外归云的博客园] 垃圾代码回收添加脚本恢复机制,因为框架会自动生成一些代码,如果代码生成后出现问题导致代码没有正常删除掉,则会造成代码垃圾,在auty目录添加recovery.py文件: ...
jQuery validation学习（2）验证身份证
验证邮编 jQuery.validator.addMethod("isZipCode", function(value, element) { -]{}$/; return thi ...
GMF中，删除节点和连线的另一种实现
问题在GMF中,如果需要programmatically删除节点或连线,在google中我们很容易搜索到<GMF中,删除节点和连线的实现>一文(我并不确定这是原创作者的原始链接),很多人 ...
Android Studio常见问题 -- uses-sdk:minSdkVersion 8 cannot be smaller than version 9 declared in library
问题描述 * What went wrong:Execution failed for task ':app:processDebugManifest'.> Manifest merger fa ...
CentOS 7系统挂载NTFS分区的移动硬盘（转载及体验 CentOS6.5系统挂载NTFS分区的移动硬盘）
作为IT的工作者,避免不了使用Linux系统,我比较喜欢CentOS,为了锻炼自己对CentOS的熟练操作,就把自己的笔记本装了CentOS,强制自己使用,使自己在平时的工作中逐渐掌握Linux的学习 ...

Learning with Trees