In the previous post, we talk about a very popular Boosting algorithm - Gradient Boosting Decision Tree. The key of GBM is using Gradient Descent to optimize the loss function. But why Gradient Descent, not other numeric optimization method? Is it the fastest optimization method? Is there any problem with Gradient Descent?

If the target function is convex, then we won't need to worry about any of the above problem. Gradient Descent will lead us to the close solution in one iteration. However in real world, target function is rarely convex. In this case, Gradient Descent is no longer the fastest method, and it has some other problem, including:

  • Step length is hard to choose. Small step converge slow, big step may lead to zigzag
  • Converge can be slow when close to the optimum

All above issues with gradient descent is because it only consider first order of target function. In other words it tries to use linear function to approximate target function and find the direction that loss reduce fastest. Following this logic, if we use second order polynomial to approximate target function, we shall get better estimation of the direction. This leads to Newton Raphson method.

We define L as loss function, g is the gradient, and h is the second order derivative. Below is a second order Taylor expansion of the loss function.

\[L(x_n+\epsilon) \approx L(x_n) + g \epsilon + 0.5 h \epsilon^2
\]

We want to find \(\epsilon\) that minimize L, in other words g=0 and h>0. We take first order derivative of above approximation and get below:

\[\epsilon = - g/h
\]

This is the numeric optimization method used in XGBoost - Newton Raphson.

Compared with other boosting algorithm, XGBoost has a few innovations in following areas, including:

  • Second order numeric optimization method
  • Regularized model formulation
  • Algo acceleration

Next let's dig deeper into above features of XGBoost.

Regularization

XGBoost as a member in Boosting family, is similar to GBM in many ways. First XGBoost fits an additive model of multiple base learner as GBM:

\[\hat{y} = F_m(x) = \sum_{m=1}^M f_k(x)
\]

L2 Regularization

However, XGBoost add regularization into the model formulation(the loss function), which directly impact the training of each base learner.

\[L(\hat{y}) = \sum_i L(y_i,\hat{y_i}) + \sum_k\Omega(f_k)
\]

$\hat{y} = \sum_k f_k $ is the current prediction

$\Omega(f) = \lambda T + \frac{1}{2} \lambda ||w||^2 $ where T is the number of leafs for base learner and W is the leaf assignment.

Shrinkage

L2 regularization prevents over-fitting by shrinking the parameter[ More detail discussed in this post ]. A more straightforward shrinkage method is directly using shrinkage factor to scale the output of each base learner, in order to reduce the impact of single base learner. The usage of shrinkage is same as the learning rate in neural network.

Column sampling

Also XGBoost brings in another Technic widely used in random forest - column sampling. This is similar to dropout rate in neural network, which spreads out the weights across features and also functions as bagging.

Relevant parameter

my_model = XGBRegressor(
lambda=1, ## default=1, L2 Regularization
alpha=0, ## default=0, L1 Regularization
eta=0.3, ## default=0.3, shrinkage rate
colsample_bytree=1, ## default=1, % of column sampled for each base learner
colsample_bylevel=1, ## default=1, % of column sampled for each split
)

Boosting

Compared with GBM, XGBoost uses 2nd order Taylor expansion to approximate the loss function. To make this more comparable to Tree - Gradient Boosting Machine with sklearn source code, let's first follow the same order: loss function, linear base learner and tree base learner.

Loss approximation - Newton Raphson

To approximate the loss function at t iteration, we do second order Taylor expansion at current prediction \(\hat{y}\), see below:

\[\begin{align}
L(y,\hat{y}) = \sum_{i=1}^N l(y_i,\hat{y_i}) + g(\hat{y_i})f_t(x_i) + \frac{1}{2}h(\hat{y_i})f_t(x_i)^2
\end{align}
\]

where \(g(\hat{y_i})\) is the gradient at current prediction, which is the one fitted in each base learner of GBM. And \(h(\hat{y_i})\) is the second order derivative, known as Hessian matrix in high dimension.

*Hessian Matrix

Hessian matrix is a square matrix of second order partial derivative, where \(H_ij = \frac{\partial^2f}{\partial{x_i} \partial{x_j}}\)

\(H = \begin{bmatrix}
\frac{\partial^2f}{\partial{x_1} \partial{x_1}}
& \frac{\partial^2f}{\partial{x_1} \partial{x_2}}
& \frac{\partial^2f}{\partial{x_1} \partial{x_3}} \\[0.3em] \frac{\partial^2f}{\partial{x_2} \partial{x_1}}
& \frac{\partial^2f}{\partial{x_2} \partial{x_2}}
& \frac{\partial^2f}{\partial{x_2} \partial{x_3}} \\[0.3em]
\frac{\partial^2f}{\partial{x_3} \partial{x_1}}
& \frac{\partial^2f}{\partial{x_3} \partial{x_2}}
& \frac{\partial^2f}{\partial{x_3} \partial{x_3}} \\[0.3em]
\end{bmatrix}\)

Given above example, we can tell the Hessian is symetric. And when f is convex, Hessian is positive semi-definite.

Linear base learner

When a linear base learner is used to optimize the loss function at each iteration. We can further simplify the above function:

\[L(y,\hat{y}) = constant + \frac{1}{2}\sum_{i=1}^Nh(\hat{y_i}) [ f_t(x_i) + \frac{g(\hat{y_i})}{h(\hat{y_i})} ]^2
\]

Therefore Newton Raphson leads to a weighted least square regression against \(-\frac{g(\hat{y_i})}{h(\hat{y_i})}\) at each iteration.

\[\hat{f_t(x)} = argmin\sum_{i=1}^Nh(\hat{y_i}) [ (-\frac{g(\hat{y_i})}{h(\hat{y_i})}) - f_t(x_i) ]^2
\]

In comparison, gradient descent is solving a least square regression against \(-g(\hat{y_i})\) at each iteration.

\[\hat{f_t(x)} = argmin\sum_{i=1}^N[ (-g(\hat{y_i})) - f_t(x_i) ]^2
\]

For linear base learner, gradient descent still need line search for optimal step length. While Newton Raphson solve the direction and step length at the same time. When \(h(\hat{y_i})\) is bigger, meaning \(g(\hat{y_i})\) change faster, then use smaller steps \(|-\frac{g(\hat{y_i})}{h(\hat{y_i})}|\).

Tree base learner

Let's add in L2 regularization to get full representation of loss function in XGBoost, and solve it with tree structure.

Using tree as base learner, all sample end up in the same leaf has same prediction and all leafs are disjoint. Therefore we can further simplify the loss function into:

\[\begin{align}
L(y,\hat{y}) & = \sum_{i=1}^N l(y_i,\hat{y_i}) + g(\hat{y_i})f_t(x_i) + \frac{1}{2}h(\hat{y_i})f_t(x_i)^2 + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^J w_j \\
& = constant + \sum_{j=1}^J[\sum_{x_i \in j} g(x_i)w_j +\frac{1}{2}(\sum_{x_i \in j}h(x_i) + \lambda)w_j^2] + \gamma T
\end{align}
\]

A close solution of each leaf is below:

\[w_j^* = -\frac{\sum_{x_i \in j}g(\hat{y_i})}{\sum_{x_i \in j}h(\hat{y_i}) + \lambda }
\]

And following loss function:

\[L_t = -\frac{1}{2}\sum_{j=1}^J\frac{(\sum_{x_i \in j}g(\hat{y_i}))^2}{\sum_{x_i \in j}h(\hat{y_i}) + \lambda } + \gamma T
\]

However it is impossible to iterate through all possible tree structure to minimize above loss function. Therefore greedy search used in GBM is also applied here. We grow the tree from the root and search for best split at each iteration.

The best split is selected by maximum the loss reduction given above loss score, similar as Information Gain and Gini Index.

\[L_{split} = argmax \frac{1}{2}[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}] - \gamma
\]

Where \(G_L = \sum_{x_i \in left}g(\hat{y_i})\) and \(H_L = \sum_{x_i \in left}h(\hat{y_i})\) is the sum of gradient/hessian in child node.

Pros and Cons

So what's the advantage of XGBoost over GBM?

  1. Newton Raphson has better accuracy compared with Gradient Descent in estimating the loss reduction direction. Theoretically it should converge faster.

  2. Regularization term \(\lambda\) helps prevent over-fitting by shrinking the weight on each individual base learner.

  3. Regularization term \(\gamma\) can be viewed as a threshold for early stop. If the loss reduction is smaller than \(\gamma\) then stop growing. It prevents base learner from being too complicated.

However if Newton Raphson has so many advantages, then why gradient descent is more wildly used in machine learning. Newton Raphson also has its own limitation:

  1. Computing Hessian matrix can be very time consuming

  2. Loss function must have second order derivative

Relevant parameter

my_model = XGBRegressor(
booster='gbtree', ## gbtree or gblinear
max_depth=6, ## default=6, max depth of base learner.0 indicates no limit
gamma=0 ## default=0, minimum loss reduction(threshold for early stopping)
)

Acceleration

One of the most time consuming part of Boosting algorithm is split finding, which consist of 2 parts: sorting features and searching through all values of each feature. Let's see how XGBoost optimizes these process.

Histogram binning - Approximate Algo

The key concept behind Boosting algorithm is to use biased simple base learner to approximate an unbiased final prediction. In other words accuracy is not the top concern for each base learner. Then we really don't need to search through all value to find the optimal split.

Therefore an approximate approach can be used. Basically each feature is split into buckets, and at each split only aggregate stats for each buckets are searched.

Here comes the next question. How many bins shall we use and how to bin the feature?

Number of bins is a hyper-parameter in the algorithm, the number of bin needs to be relative small compared with the unique feature value to speed up the algo. While too small number will lead to poor performance.

As for how to bin the feature, quantile is usually used to make sure data are evenly distributed. XGBoost propose weighted Quantile Sketch for binning, which is a weighted quantile with \(h(\hat{y_i})\) being the weight. This is in line with our analysis with Newton Raphson - it solves weighted least square for linear base learner.

Also you can choose to bin all the feature at the very beginning - Global method, or bin all the feature at each split - Local method. In comparison Global method needs more bins and only one computation, while Local method needs less bins but more computation, which may be better fit for deep tree.

Column Block

With histogram approximation, we no longer need to iterate across all values. What about feature sorting?

XGBoost use Column Block to solve this problem, where data is presorted for each column(feature) and stored in multiple blocks, which also allow parallel computation for split finding.

Parallel Computing

XGBoost is well known for its ability for parallel computing. Then which part of computation is parallelized?

  1. Base learner is still trained one after another in series
  2. Split finding for all feature runs in parallel
  3. Split finding within one feature also runs in parallel

Sparsity aware

XGBoost provides a unified way to handle missing value, no matter it is manually created missing value (one hot encoding) or natural one. The mechanism has two benefit:

  • Speed up sparse data training

    When search for optimal split for each feature, only consider non-missing value(0 or NA). \(I_k = \{i \in I| x_{ik} \notin \{0,NA\} \}\).
  • improve missing value prediction

    Most of the algorithm need data cleaning to deal with missing value before training. We either remove or replace the missing value with some aggregate stats. For tree building, there is more options, missing value can following majority path, which has more observation.

    In XGBoost missing value is treated as a separate category. We calculate 2 optimal split by allowing missing value into left node and right node. And choose the direction that has bigger loss reduction.

Others

XGBoost has a few other designs to further speed up the algorithm.

Cache-aware Access prefetch the stats of gradient into buffer of each thread for exact algo. And optimize the block size to make sure the stats can fit into CPU cache for approximate algo.

Out-of Core Computation speed up disk reading by Block sharding (split data onto multiple disk) and Block compression (compressed data feature-wise and decompose on the fly) so that computation and disk reading can happen concurrently.

Relevant parameter

my_model = XGBRegressor(
tree_method='auto', ## exact, approx, hist(optimized on approx with bin caching), gpu_exact, gpu_hist
sketch_eps=6, ## default=0.03, lower eps leads to more bins(1 / sketch_eps). tree_method='approx'
max_bin=0 ## default=256, tree_method='hist'
)

Reference

  1. https://xgboost.readthedocs.io/en/latest/index.html
  2. Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM.
  3. T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.

Tree - XGBoost with parameter description的更多相关文章

  1. Jquery EasyUI Combotree根据选中的值展开所有父节点

    Jquery EasyUI Combotree根据选中的值展开所有父节点  Jquery EasyUI Combotree 展开父节点, Jquery EasyUI Combotree根据子节点选中的 ...

  2. Tree - Decision Tree with sklearn source code

    After talking about Information theory, now let's come to one of its application - Decision Tree! No ...

  3. JQuery EasyUI Tree

    Tree 数据转换 所有节点都包含以下属性: id:节点id,这个很重要到加载远程服务器数据 which is important to load remote data text: 显示的节点文本 ...

  4. A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

    A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON  ...

  5. XGBoost浅入浅出

    http://wepon.me/ XGBoost风靡Kaggle.天池.DataCastle.Kesci等国内外数据竞赛平台,是比赛夺冠的必备大杀器.我在之前参加过的一些比赛中,着实领略了其威力,也取 ...

  6. easyUI tree jQuery

    Tree 数据转换 所有节点都包含以下属性: id:节点id,这个很重要到加载远程服务器数据 which is important to load remote data text: 显示的节点文本 ...

  7. easyui tree的简单使用

    Tree 数据转换 所有节点都包含以下属性: id:节点id,这个很重要到加载远程服务器数据 which is important to load remote data text: 显示的节点文本 ...

  8. SSIS Parameter用法

    今天学习SSISParameter的用法,记录学习的过程. Parameters能够在Project Deployment Model下使用,不能在Package Deployment Model使用 ...

  9. 【HDU 5233】Tree chain problem (树形DP+树剖+线段树|树状数组)最大权不相交树链集

    [题目] Tree chain problem Problem Description Coco has a tree, whose vertices are conveniently labeled ...

随机推荐

  1. JAVA并发(一)

    java并发的一系列框架和技术主要是由java.util.concurrent 包所提供.包下的所有类可以分为如下几大类: locks部分:显式锁(互斥锁和速写锁)相关: atomic部分:原子变量类 ...

  2. iOS:Masonry约束经验(19-03-21更)

    1.label约束: 1).只需约束x.y 点相关就行.宽高 长度相关不用约束,就算用boundingRectWithSize计算出来的,也可能不准. 如:top.bottom二选一,trailing ...

  3. MAC升级openssl

    Mac OSX EI Capitan 10.11.6升级自带Openssl - 简书 Mac10.11升级安装openssl _ 刘春桂的博客 openssl_openssl_ TLS_SSL and ...

  4. cmd命令操作Mysql数据库

    在一次考试中,笔者因考试的电脑上没有安装操作Mysql数据库的可视化工具而不知如何操作数据库,所以在这里可以提醒各位掌握 命令行来操作数据库也是非常重要的. 笔者以惨痛的教训来警惕大家. 进入正题: ...

  5. 撩妹技能 get,教你用 canvas 画一场流星雨

    开始 妹子都喜欢流星,如果她说不喜欢,那她一定是一个假妹子. 现在就一起来做一场流星雨,用程序员的野路子浪漫一下. 要画一场流星雨,首先,自然我们要会画一颗流星. 玩过 canvas 的同学,你画圆画 ...

  6. Qt界面编程基本操作

    Qt界面编程基本操作 了解基本代码构成 类widget的头文件widget.h如下: #ifndef WIDGET_H #define WIDGET_H #include <QWidget> ...

  7. 嵌入式C语言自我修养 12:有一种宏,叫可变参数宏

    12.1 什么是可变参数宏 在上面的教程中,我们学会了变参函数的定义和使用,基本套路就是使用 va_list.va_start.va_end 等宏,去解析那些可变参数列表我们找到这些参数的存储地址后, ...

  8. C语言中const有什么用途

    自己上网查的资料. 可以定义const常量,具有不可变性.例如:const int Max=100; Max++会产生错误; 便于进行类型检查,使编译器对处理内容有更多了解,消除了一些隐患.例如: v ...

  9. mysql secure_file_priv 文件读写问题

    secure_file_priv特性 使用 show global variables like '%secure%'; 查询显示 secure_file_priv的值为null,那么secure_f ...

  10. ASP.NET 并发控制

    当多个用户试图同时修改数据时,需要建立控制机制来防止一个用户的修改对同时操作的其他用户所作的修改产生不利的影响.处理这种情况的系统叫做“并发控制”. 并发控制的类型 通常,管理数据库中的并发有三种常见 ...