机器学习方法（六）：随机森林Random Forest，bagging

欢迎转载，转载请注明：本文出自Bin的专栏blog.csdn.net/xbinworld。

技术交流QQ群：433250724，欢迎对算法、技术感兴趣的同学加入。

前面机器学习方法（四）决策树讲了经典的决策树算法，我们讲到决策树算法很容易过拟合，因为它是通过最佳策略来进行属性分裂的，这样往往容易在train data上效果好，但是在test data上效果不好。随机森林random forest算法，本质上是一种ensemble的方法，可以有效的降低过拟合，本文将具体讲解。

Background

Decision trees are a popular method for various machine learning tasks. Tree learning “comes closest to meeting the requirements for serving as an off-the-shelf procedure for data mining”, say Hastie et al.[1], because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate

先讲一讲decision tree[2]的好处：（1）特征数据放缩不变性；（2）面对无关特征更鲁棒；（3）得到确定的model。

但是decision tree往往不够准确，因为很容易产生over-fitting：一颗很深的树往往有low bias, high variance；而随机森林Random Forest通过对对多个决策树进行平均，可以显著降低variance来减少过拟合。RF带来的问题是稍稍增加一点bias，以及模型的可解释性，但是获得的收益是显著提高了准确率。

bagging

bagging[4]，也称为 bootstrap aggregating，是一种非常简单而通用的机器学习集成学习算法。RF需要用到bagging，但是其他的分类或者回归算法都可以用到bagging，以减少over-fitting（降低model的variance）。

Given a standard training set D of size n, bagging generates m new training sets D_i, each of size n′, by sampling from D uniformly and with replacement. This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

简单的来说，就是从原始训练数据集中，有放回的采样出若干个小集合，然后在每个小集合上train model，对所有的model output取平均（regression）或者投票（classification）。

bagging的每一个小集合中，不同的样本数量的期望满足这样一个性质[3]：

when drawing with replacement n′ values out of a set of n (different and equally likely), the expected number of unique draws is

n(1−e−n′/n).

回到random forest算法：给定一个有n个样本的训练集{X，Y}，

for b=1,…,B:

1. 从X中有放回的采样n个样本，组成集合{Xb，Yb}；

2. 在{Xb，Yb}上训练决策树（或者回归树）；

end

训练完成后，取所有model的平均作为输出（或者用majority vote投票决定）：

f^=1B∑b=1Bf^b(x′)

单个决策树模型很容易受到噪音数据的影响，而混合模型就不容易会。但是如果在同样的数据上train多棵树，也很容易得到强相关的树（或者是一样的树），那效果就不好了；上面的bootstrap sampling的方法就是让model看到不同的train data。

B是一个可调节的参数，一般来说选用几百或者几千棵树，或者通过cross-validation来选择最佳的B。另外，也可以通过观察out-of-bag error：在所有训练数据xi上的平均预测error（预测xi用的是些那没有用到xi来训练的trees。）同时观察train error和test error（out-of-bag error）会发现，在一定数量的tree之后，两种error会分开。

random subspace selection

在上面的tree bagging基础上，每一个model训练的时候对所选取的data split随机选取一部分特征子集（random subset of the features）来训练。这样做的目的也是为了让学到的tree不要太相似——如果有几个特征和output是强相关的，那么在很多tree里面，这些特征都会被挑选出来（train决策树的时候往往不会太深，这样就可能不会用到所有的feature。），或者在树的靠近根部，这样这些tree就很相似。

一般来说，在classification任务中，如有总共有p个features，那么在每一个split中选取p√个features；在regression任务中，推荐使用p/3（不小于5）个features。

以上就是random forest算法的内容，在决策树的基础上，做样本和特征的随机选择，是一种典型的集成算法。

Extensions

（1）ExtraTrees

进一步扩展RF算法，是一种叫extremely randomized trees,（ExtraTrees）的算法：在RF采取bagging和random subspace的基础上，进一步在每一棵树train决策树的时候，选取的split value采用随机生成。

原先决策树针对是连续数值的特征会计算局部split value，（一个特征可能可以产生多个split value，都计算一下，然后评估所有的特征中的哪一个split value最好，就以该特征的该split value分裂）；但是现在，对每一个特征，在它的特征取值范围内，随机生成一个split value，再计算看选取哪一个特征来进行分裂（树多一层）。

（2）Relationship to nearest neighbors

RF和KNN算法在形式上有类似的地方。这两个算法都可以用下面的公式来表示：

在训练集{(xi,yi)}ni=1上，预测x’的label：

y^=∑i=1nW(xi,x′)yi.

In k-NN, the weights are W(xi,x′)=1k if xi is one of the k points closest to x’, and zero otherwise.

In a tree, W(xi,x′)=1k′ if xi is one of the k’ points in the same leaf as x’, and zero otherwise.

而RF是对m棵树的平均，以单独的weight functions Wj

y^=1m∑j=1m∑i=1nWj(xi,x′)yi=∑i=1n⎛⎝1m∑j=1mWj(xi,x′)⎞⎠yi.

由此可见，RF整体也是一个近邻加权的方法( weighted neighborhood scheme)——The neighbors of x’ in this interpretation are the points x_i which fall in the same leaf as x’ in at least one tree of the forest.

References

[1] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.).

[2] https://en.wikipedia.org/wiki/Random_forest

[3] Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L. (2007); On Estimating the Size and Confidence of a Statistical Audit.

[4] https://en.wikipedia.org/wiki/Bootstrap_aggregating