We are now going to consider a rather different approach to machine learning, starting with one of the most common and powerful data structures in the whole of computer science: the binary tree. The computational cost of making the tree is fairly low, but the cost of using it is even lower:

For these reasons, classification by decision trees has grown in popularity over recent years. You are very likely to have been subjected to decision trees if you've ever phoned a helpline, for example for computer faults. The phone operators are guided through the decision treed by your answers to their questions.

The idea of a decision tree is that we break classification down into a set of choices about each feature in turn, starting at the root (base) of the tree and progressing down to the leaves, where we receive the classification decision. The trees are very easy to understand, and can even be turned into a set of if-then rules, suitable for use in a rule induction system.

In terms of optimisation and search, decision trees use a greedy heuristic to perform search, evaluating the possible options at the current stage of learning and making the one that seems optimal at that point. This works well a surprisingly large amount of the time.

12.1 USING DECISION TREES

As a student it can be difficult to decide what to do in the evening. There are four things that you actually quite enjoy doing, or have to do: going to the pub, watching TV, going to a party, or even (gasp) studying. The choice is sometimes made for you – if you have an assignment due the next day, then you need to study, if you are feeling lazy then the pub isn't for you, and if there isn't party then you can't go to it. You are looking for a nice algorithm that will let you decide what to do each evening without having to think about it every night.

Each evening you start at the top (root) of the tree and check whether any of your friends know about a party that night. If there is one, then you need to go, regardless. Only there is not a party do you worry about whether or not you have an assignment deadline coming up. If there is a crucial deadline, then you have to study, but if there is nothing that is urgent for the next few days, you think about how you feel. A sudden burst of energy might make you study, but otherwise you'll be slumped in front of the TV indulging your secret love of Shortland Street (or other soap opera of your choice) rather than studying. Of course, near the start of the semester when there are no assignments to do, and you are feeling rich, you'll be in the pub.

One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if … then rules) that then go into program code very simply - the first part of the tree above can be turned into:

ifthere is a partythengo to it

ifthere is not a party andyou have an urgent deadlinethenstudy

etc.

12.2 CONSTRUCTING DECISION TREES

In the example above, the three features that we need for the algorithm are the stated of your energy level, the date of your nearest deadline, and whether or not there is a party tonight. The question we need to ask is how, based on those features, we can construct the tree. There are a few different decision tree algorithms, but they are almost all variants of the same principle: the algorithms build the tree in a greedy manner starting at the root, choosing the most informative feature at each step. We are going to start by focusing on the most common: Quinlan's ID3, although we'll also mention its extension, known as C4.5, and another known as CART.

There was an important word hidden in the sentence above about how the trees work, which was informative. Choosing which feature to use next in the decision tree can be thought of as playing the game '20 Questions', where you try to elicit the item your opponent is thinking about by asking questions about it. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would as 'Is it an animal?' before you ask 'Is it a cat?'. The idea is to quantify this question of how much information is provided to you by knowing certain facts. Encoding this mathematically is the task of information theory.

12.2.1 Quick Aside: Entropy in Information Theory

Information theory was 'born' in 1948 when Claude Shannon published a paper called “A Mathematical Theory of Communication”. In that paper, he proposed the measure of information entropy, which describes the amount of impurity in a set of features. The entropy

where the logarithm is base

For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.

Information theory is a very interesting subject. It is possible to download Shannon's 1948 paper from the Internet, and also to find many resources showing where it has been applied. There are now while journals devoted to information theory because it is relevant to so many areas such as computer and telecommunication networks, machine learning, and data storage.

12.2.2 ID3

Now that we have a suitable measure for choosing which feature to choose next, entropy, we just have to work out how to apply it. The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as the entropy of the whole set minus the entropy when a particular feature is chosen. This is defined by (where

As an example, suppose that we have data (with outcomes)

The ID3 algorithm computes this information gain for each feature and chooses the one that produces the highest value. In essence, that is all there is to the algorithm. It searches the space of possible trees in a greedy way by choosing the feature with the highest information gain at each stage. The output of the algorithm is the tree, i.e., a list of nodes, edges, and leaves. As with any tree in computer science, it can be constructed recursively. At each stage the best feature is selected and then remove from the dataset, and the algorithm is recursively called on the rest. The recursion stops when either there is only one class remaining in the data (in which case a leaf is added with that class as its label), or there are no features left, when the most common label in the remaining data is used.

The ID3 Algorithm

  • If all examples have the same label:

    - return a leaf with that label

  • Else if there are no features left to test:

    - return a leaf with the most common label

  • Else:

- choose the feature

- add a branch from the node for each possible value

- for each branch:

  • calculate

  • recursively call the algorithm with

Owing to the focus on classification for real-world examples, trees are often used with text features rather than numeric values.

12.3 CLASSIFICATION AND REGRESSION TREES (CART)

There is another well-known tree-based algorithm, CART, whose name indicates that it can be used for both classification and regression. Classification is not wildly different in CART, although it is usually constrained to construct binary trees. This might seem odd at first, but there are sound computer science reasons why binary trees are good, as suggested in the computation cost discussion above, and it is not a real limitation. Even in the example that we started the chapter with, we can always turn questions into binary decisions by splitting the question up a little. Thus, a question that has three answers (say the question about when your nearest assignment deadline is, which is either 'urgent', 'near' or 'none') can be split into two questions: first, 'is the deadline urgent?', and then if the answer to that is 'no', second 'is the deadline near?' The only real difference with classification in CART is that a different information measure is commonly used.

12.3.1 Gini Impurity

The entropy that was used in ID3 as the information measure is not the only way to pick features. Another possibility is something known as the Gini impurity. The 'impurity' in the name suggests that the aim of the decision tree is to have each leaf node represent a set of datapoints that are in the same class, so that there are no mismatches. This is known as purity. If a leaf is pure then all of the training data within it have just one class. In which case, if we count the number of datapoints at the node (or better, the fraction of the number of datapoints) that belong to a class

where

Either way, the Gini impurity is equivalent to computing the expected error rate if the classification was picked according to the class distribution. The information gain can then be measured in the same way, subtracting each value

The information measure can be changed in another way, which is to add a weight to the misclassificaitons. The idea is to consider the cost of misclassifying an instance of class

12.3.2 Regression in Trees

The new part about CART is its application in regression. While it might seem strange to use trees for regression, it turns out to require only a simple modification to the algorithm. Suppose that the outputs are continuous, so that a regression model is appropriate. None of the node impurity measures that we have considered so far will work. Instead, we'll go back to our old favorite - the sum-of-squares error. To evaluate the choice of which feature to use next, we also need to find the value at which to split the database according to that feature. Remember that the output is a value at each leaf. In general, this is just a constant value for the output, computed as the mean average of all the datapoints that are situated in that leaf. This is the optimal choice in order to minimize the sum-of-squares error, but it also means that we can choose the split point quickly for a given feature, by choosing it to minimize the sum-of-squares error. We can then pick the feature that has the split point that provides the best sum-of-squares error, and continue to use the algorithm as for classification.

Learning with Trees的更多相关文章

  1. Awesome Deep Vision

    Awesome Deep Vision  A curated list of deep learning resources for computer vision, inspired by awes ...

  2. Seven Python Tools All Data Scientists Should Know How to Use

    Seven Python Tools All Data Scientists Should Know How to Use If you’re an aspiring data scientist, ...

  3. Machine Learning Methods: Decision trees and forests

    Machine Learning Methods: Decision trees and forests This post contains our crib notes on the basics ...

  4. 【Machine Learning】决策树案例:基于python的商品购买能力预测系统

    决策树在商品购买能力预测案例中的算法实现 作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...

  5. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  6. 【深度学习Deep Learning】资料大全

    最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books  by Yoshua Bengio, Ian Goodfellow and Aaron C ...

  7. [Machine Learning & Algorithm] 随机森林(Random Forest)

    1 什么是随机森林? 作为新兴起的.高度灵活的一种机器学习算法,随机森林(Random Forest,简称RF)拥有广泛的应用前景,从市场营销到医疗保健保险,既可以用来做市场营销模拟的建模,统计客户来 ...

  8. [Machine Learning] 国外程序员整理的机器学习资源大全

    本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C ...

  9. Exploratory Undersampling for Class-Imbalance Learning

    Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses on ...

随机推荐

  1. 一个Json结构对比的Python小工具兼谈编程求解问题

    先上代码. jsondiff.py #!/usr/bin/python #_*_encoding:utf-8_*_ import argparse import json import sys rel ...

  2. wex5 教程之 图文讲解 文件上传attachmentSimple(1)

    视频教程地址:http://v.youku.com/v_show/id_XMTc4NDAyMTY4OA==.html 效果预览: 1 调用attchmentSimple组件,打开文件管理器,并选中,显 ...

  3. <Interview problem>二进制加法

    闲来无事,搜集一些有趣的面试题,以及解决方案,有些代码可能会自己写, 有些网上有比较完整的方案了,就直接把代码拷贝过来,描述一下解题思路. Given two binary strings, retu ...

  4. Script 简单语句的练习题

    猜拳 <body>请输入剪刀或者石头或者布:<br /><input type="text" id="A"/><inp ...

  5. Java 菜鸟学习之 script脚本语句

    脚本语句  一般分为  顺序语句   分支语句   循环语句 顺序语句  就是从上到下依次读取 分支语句 是if语句 if(判断条件) { 满足条件执行的语句 } else { 不满足条件执行的语句 ...

  6. zigbee学习之路(六):Time3(查询方式)

    一.前言 通过上次的学习,相信大家对cc2530单片机的定时器的使用有了一定的了解,今天我们来介绍定时器3的使用,为什么介绍定时器3呢,因为它和定时器4功能是差不多的,所以学会定时器3,就基本掌握了c ...

  7. linux mysql导入导出

    linux下导入.导出mysql数据库命令 一.导出数据库用mysqldump命令(注意mysql的安装路径,即此命令的路径):1.导出数据和表结构:mysqldump -u用户名 -p密码 数据库名 ...

  8. html5 svg动画

    http://www.zhangxinxu.com/sp/svg/ 以上是svg的一个线上编辑器,也可以adobe Illustrator制作生成. 我们通过以上编辑器可以获得以下代码. 例: < ...

  9. [问题2014A01] 复旦高等代数 I(14级)每周一题(第三教学周)

    [问题2014A01]  试求下列 \(n\) 阶行列式的值: \[ |A|=\begin{vmatrix} 1 & x_1(x_1-a) & x_1^2(x_1-a) & \ ...

  10. 解决T400\T500\W500等安装win10驱动后黑屏问题

    T400.W500.T500等机型有双显卡的机型,在安装WIn10后会在驱动后黑屏,但可见启动画面: 原因:没有对应的双显卡驱动程序,导致系统无法正确识别显卡: 解决方法:开机按F1进入Bios,在显 ...