gbdt和xgboost中feature importance的获取

来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少，降低越多的，说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):

    total_sum = np.zeros((self.n_features, ), dtype=np.float64)

    for stage in self.estimators_:

        stage_sum = sum(tree.feature_importances_

                        for tree in stage) / len(stage)

        total_sum += stage_sum

    importances = total_sum / len(self.estimators_)

    return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_

                for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):

    total_sum = np.zeros((self.n_features, ), dtype=np.float64)

    for tree in self.estimators_:

        total_sum += tree.feature_importances_

    importances = total_sum / len(self.estimators_)

    return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):

    """Computes the importance of each feature (aka variable)."""

    while node != end_node:

        if node.left_child != _TREE_LEAF:

            # ... and node.right_child != _TREE_LEAF:

            left = &nodes[node.left_child]

            right = &nodes[node.right_child]

            importance_data[node.feature] += (

                node.weighted_n_node_samples * node.impurity -

                left.weighted_n_node_samples * left.impurity -

                right.weighted_n_node_samples * right.impurity)

        node += 1

    importances /= nodes[0].weighted_n_node_samples

    return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (

    node.weighted_n_node_samples * node.impurity -

    left.weighted_n_node_samples * left.impurity -

    right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

arcgisJs之featureLayer中feature的获取
arcgisJs之featureLayer中feature的获取在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...
XGBoost中参数调整的完整指南（包含Python中的代码）
(搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日介绍如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...
一步一步理解GB、GBDT、xgboost
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
GBDT,Adaboosting概念区分 GBDT与xgboost区别
http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...
机器学习（八）—GBDT 与 XGBOOST
RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性. 根据个体学习器的生成方式,目前 ...
GB、GBDT、XGboost理解
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
提升学习算法简述：AdaBoost, GBDT和XGBoost
1. 历史及演进提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...
机器学习总结（一） Adaboost,GBDT和XGboost算法
一: 提升方法概述提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...
机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

Ants UVA - 1411（km板题竟然让我换了个板子）
题意: 给出n个白点和n个黑点的坐标,要求用n条不相交的线段把它们连接起来,其中每条线段恰好连接一个白点和一个黑点,每个点恰好连接到一条线段解析: 带入负的欧几里得距离求就好了假设a1-b1 与 ...
基于数组实现Java 自定义Stack栈类及应用
栈是存放对象的一种特殊容器,在插入与删除对象时,这种结构遵循后进先出( Last-in-first-out,LIFO)的原则.java本身是有自带Stack类包,为了达到学习目的已经更好深入了解sta ...
【目标检测】Faster RCNN算法详解
Ren, Shaoqing, et al. “Faster R-CNN: Towards real-time object detection with region proposal network ...
洛谷P1445 [Violet] 樱花 (数学)
洛谷P1445 [Violet] 樱花题目背景我很愤怒题目描述求方程 1/X+1/Y=1/(N!) 的正整数解的组数,其中N≤10^6. 解的组数,应模1e9+7. 输入输出格式输入格式: ...
中南多校对抗赛第三场 B
B:Arithmetic Progressions 题意: 给你一个长度为n的序列,问你这个序列中长度最长的等差数列长度为多少题解: 方法一:将数组从小到大排序,n方扫,枚举出公差d,然后二分找有多 ...
vmvare安装ubuntu后
配置源: http://wiki.ubuntu.org.cn/%E6%BA%90%E5%88%97%E8%A1%A8#Trusty.2814.04.29.E7.89.88.E6.9C.AC 清理工作: ...
DEV GridControl打印导出
/// <summary> /// 打印 /// </summary> /// <param name="sender"></param& ...
NOIP模拟赛14
期望得分:0+100+100=200 实际得分:0+100+100=200 T1 [Ahoi2009]fly 飞行棋 http://www.lydsy.com/JudgeOnline/problem. ...
5 Techniques To Understand Machine Learning Algorithms Without the Background in Mathematics
5 Techniques To Understand Machine Learning Algorithms Without the Background in Mathematics Where d ...
.net core 中 identity server 4 之Topic --定义API资源
想要让客户端能够访问API资源,就需要在Identity Server中定义好API的资源. Scope作用域:即API资源的访问范围限制. 作用域是一个资源 (通常也称为 Web API) 的标识符 ...

gbdt和xgboost中feature importance的获取

gbdt和xgboost中feature importance的获取的更多相关文章

随机推荐

热门专题