• Three different methods for parallel gradient boosting decision trees.
  • My algorithm and implementation is competitve with (and in many cases better than) the implementation in OpenCV and XGBoost (A parallel GBDT library with 750+ stars on GitHub).

 

Introduction

Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models. Gradient Boosting Decision Trees use decision tree as the weak prediction model in gradient boosting, and it is one of the most widely used learning algorithms in machine learning today. Its high accuracy makes that almost half of the machine learning contests are won by GBDT models. Below shows an example of the model.

Figure from http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

The general idea of the method is additive training. At each iteration, a new tree learns the gradients of the residuals between the target values and the current predicted values, and then the algorithm conducts gradient descent based on the learned gradients. The algorithm description from Wikipedia is showed as followed:

We can see that this is a sequential algorithm. Therefore, we can't parallelize the algorithm like Random Forest. We can only parallelize the algorithm in the tree building step. Therefore, the problem reduces to parallel decision tree building.

Experiment Setting

Since some experimental results will be showed when introducing the algorithm, we first introduce the dataset and the setting of the experiments. The data we use is from a competition of IJCAI'15. We extracted one small dataset and one large datset from the data. The statistics of the small dataset and the large dataset are showed as follows:

  • Small dataset: 130K instances, each with 42 attributes.
  • Large dataset: 1M instances, each with 42 attributes, generated by duplicating the small dataset for eight times.

Without specification, the experimental results below are obtained from the small dataset. All the running time below are measured by growing 100 trees with maximum depth of a tree as 8 and minimum weight per node as 10. All the experiments are performed on a Debian machine with eight Intel E5-2650 2.0GHz cores and 64GB memory.

Sequential Decision Tree Building

We first define the input and output of the decision tree building problem:

  • InputN instances each with m attributes and one target values.
  • Output: A fitted decision tree for the input.

We also recap the algorithm for sequential decision tree building below. The building process grows a decision tree by levels. At each level, the algorithm first enumerates each leaf node at the level, and then conducts a split finding process on the node. During the split finding process for a node, the algorithm enumerates each features, then sorts the instances in the node by the feature values, and finds the best split of that feature in the node by linear scan. Finally the algorithm chooses the best split among those of all the features to split the node.

Method 1: Parallelize Node Building at Each Level

A simple idea of parallel decision tree bulding is to parallelize node building at each level. However, this method has a serious workload imbalanced problem. The reason is that a decsion tree tends to purify its nodes to obtain high prediction accuracy, and therefore many of the nodes will only contain a small group of training instances that have purified results, while some other nodes contain large group of trianing instances. The figure below shows an example of the imbalanced workload problem. Suppose we are going to build the nodes in the red box in parallel. We can see that the first and the third nodes contain much less training instances than the second and the fourth node, which causes the workload imbalanced.

The figure below shows the speedup of the node parallelization method. We can see that we only gain a very small speedup from node parallelization due to the workload imbalanced problem.

Method 2: Parallelize Split Finding on Each Node

Recall that in the split finding process on a node (the process is showed below), we need to enumerate each feature to find the split. The idea of this method is to parallelize the split finding process, so that in each node, the algorithm find split for different features in parallel.

The speedup figure below shows that this method performs better than node parallelization. However, it still fails to achieve half of the peek speedup. The main problem of this method is that it will has too much overhead for small nodes. When a decision tree grows deeper, most of the nodes will only contain a small number of training instances. In this case, the computation cost for each node is very small, and the benefit brought by parallel computing can not cover the overhead brought by context switching, thread joining, and etc., which makes the method fails to achieve a good speedup. However, this method indeed points us to a correct direction, and our final method is based on parallel split finding by features.

Method 3: Parallelize Split Finding at Each Level by Features

It is showed above that at each level, the sequential building process of decision tree has two loops, the first one is an outer loop for enumerating the leaf nodes, and the second one is an inner loop that enumerates the features. The idea of this method is to swap the order of these two loops, so that we can parallelize the split finding for different features at the same level. A pseudocode of the algorithm is showed below. We can see that by changing the order of the loop, we also avoid sorting the instances in each node. We can sort the instances at the start of the whole building process, and then use the same sorting result at each level. On the other hand, note that to keep the correctness of the algorithm, each thread needs to carefully maintain their scaning status of each leaf node during the linear scan process, which significantly increases the coding complexity of the algorithm.

The advantages of the method are:

  • Workload are totally balanced. Since the number of instances for each feature is the same, the workload for different jobs is the same. Thus, we do not have the workload imbalanced problem in method 1.
  • Overhead for parallelization is small. Since we parallelize split finding at the whole level rather than a single node, the benefit from parallel computing is totally enough to cover the overhead from parallel computing.

The figures below show the speedup of the method and the running time comparison between this method and method 2. On the left figure, we can see that the method has an almost perfect speedup with two threads, and a 4.8x speedup with eight threads. On the right figure, we can see that the method is much faster than method 2, and it is because of the lower time complexity of the algorithm and the smaller overhead from multi-threading.

 

Compare with OpenCV and XGBoost

In this section, I show the effectiveness of my method and implementation by comparing with two strong baselines: OpenCV and XGBoost. The introduction of the two baselines are showed as follows:

  • OpenCV: a sequential GBDT implementation in the OpenCV library. The algorithm will do early-stop pruning while growing a tree to reduce the computation.
  • XGBoost: a parallel GBDT library with 750+ stars on GitHub.

My algorithm and implementation is competitve with (and in many cases better than) the implementation in OpenCV and XGBoost. The figures below show the running time of the three implementations on the two datasets. We can see that on the small dataset, my implementation is slightly slower than the OpenCV implementation on the single-thread case, but becomes faster than OpenCV when the number of threads is more than one since OpenCV does not have multi-thread implementation. When comparing with XGBoost on the small dataset, my implementation is faster than XGBoost on each tested thread configuration. On the large dataset, we can see that OpenCV performs significatly better in the single-thread case due to the early pruning technique, but worse than my method in the multi-thread cases. When comparing with XGBoost, in general, my implementation performs slightly better than XGBoost, but the performance difference between the two methods are not obvious.

 

[Code Download]

Last updated by Zhanpeng Fang, May. 11th 2015.

Parallel Gradient Boosting Decision Trees的更多相关文章

  1. Gradient Boosting, Decision Trees and XGBoost with CUDA ——GPU加速5-6倍

    xgboost的可以参考:https://xgboost.readthedocs.io/en/latest/gpu/index.html 整体看加速5-6倍的样子. Gradient Boosting ...

  2. Facebook Gradient boosting 梯度提升 separate the positive and negative labeled points using a single line 梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

    https://www.quora.com/Why-do-people-use-gradient-boosted-decision-trees-to-do-feature-transform Why ...

  3. Gradient Boosting Decision Tree学习

    Gradient Boosting Decision Tree,即梯度提升树,简称GBDT,也叫GBRT(Gradient Boosting Regression Tree),也称为Multiple ...

  4. GBDT(Gradient Boosting Decision Tree)算法&协同过滤算法

    GBDT(Gradient Boosting Decision Tree)算法参考:http://blog.csdn.net/dark_scope/article/details/24863289 理 ...

  5. GBDT(Gradient Boosting Decision Tree) 没有实现仅仅有原理

                阿弥陀佛.好久没写文章,实在是受不了了.特来填坑,近期实习了(ting)解(shuo)到(le)非常多工业界经常使用的算法.诸如GBDT,CRF,topic model的一些算 ...

  6. 梯度提升树 Gradient Boosting Decision Tree

    Adaboost + CART 用 CART 决策树来作为 Adaboost 的基础学习器 但是问题在于,需要把决策树改成能接收带权样本输入的版本.(need: weighted DTree(D, u ...

  7. 论文笔记:LightGBM: A Highly Efficient Gradient Boosting Decision Tree

    引言 GBDT已经有了比较成熟的应用,例如XGBoost和pGBRT,但是在特征维度很高数据量很大的时候依然不够快.一个主要的原因是,对于每个特征,他们都需要遍历每一条数据,对每一个可能的分割点去计算 ...

  8. Gradient Boosting Decision Tree

    GBDT中的树是回归树(不是分类树),GBDT用来做回归预测,调整后也可以用于分类.当采用平方误差损失函数时,每一棵回归树学习的是之前所有树的结论和残差,拟合得到一个当前的残差回归树,残差的意义如公式 ...

  9. 后端程序员之路 10、gbdt(Gradient Boosting Decision Tree)

    1.GbdtModelGNode,含fea_idx.val.left.right.missing(指向left或right之一,本身不分配空间)load,从model文件加载模型,xgboost输出的 ...

随机推荐

  1. VMware Authorization Service不能启动 VMware虚拟机状态已挂起无法恢复解决方案

    在网上看说在服务里面启动 但也是不能用 电脑上说是WINDOWS无法启动VMware Authorization Service服务(位于本地计算机上)错误:1068 依赖服务或组无法启动 这个很简单 ...

  2. 重启随机游走算法(RWR:Random Walk with Restart)

    1 pagerank算法的基本原理 Pagerank算法是Google的网页排名算法,由拉里佩奇发明.其基本思想是民主表决.在互联网上,如果一个网页被很多其他网页所链接,说明它受到普遍的承认和信赖,那 ...

  3. Arria10中的IOPLL与fPLL

    最近在用Arria10.从480降到270的过程中,IOPLL出现问题,大概是说几个Bank的IOPLL已经被占用,没有空间再给别的IOPLL去适配. 因为在工程中,所用的PLL多达35个之多,其中明 ...

  4. oss上传文件夹

    最近公司做工程项目,实现文件夹云存储上传. 网上找了很久,发现网上很多项目都存在相似问题,最后终于找到了一个符合我要求的项目. 工程如下: 这里对项目的文件夹云存储上传进行分析,实现文件夹上传,如何进 ...

  5. thrift使用总结

    转自 http://blog.csdn.net/qq_27784479/article/details/73250958 Apache Thrift软件框架用于可扩展的跨语言服务开发,简单来说就是RP ...

  6. Surface 2装机必备软件指南

    新买的Surface到货了还不知道有什么用,每天就用来划划点点?有点太浪费了吧!跟哥走,哥给你推荐几款Surface 2装机必备的软件~应用商店,走起~ 初次使用看过来:Win8宝典 如果你是一个像我 ...

  7. hdu 5094 状压bfs+深坑

    http://acm.hdu.edu.cn/showproblem.php?pid=5094 给出n*m矩阵 给出k个障碍,两坐标之间存在墙或门,门最多10种,状压可搞 给出s个钥匙位置及编号,相应的 ...

  8. hdu 5054

    http://acm.hdu.edu.cn/showproblem.php?pid=5054 确定是否矩形中点 这都能hack成功,无语 #include <cstdio> #includ ...

  9. 认识Hadoop

    概述 开源.分布式存储.分布式计算 大数据生态体系 特点:开源.社区活跃 囊括了大数据处理的方方面面 成熟的生态圈 推荐系统 应用场景 搭建大型数据仓库,PB级数据的存储.处理.分析.统计 日志分析 ...

  10. CSS 基础 例子 display属性:block、inline和inline-block的区别

    HTML中块级元素(block)和行级元素(inline):比如div就是常见的块级元素,span就是常见的行级元素. 可以通过css的display属性来设置一个元素到底是块级,还是行级元素:dis ...