gbdt和xgboost中feature importance的获取

来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少，降低越多的，说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):

    total_sum = np.zeros((self.n_features, ), dtype=np.float64)

    for stage in self.estimators_:

        stage_sum = sum(tree.feature_importances_

                        for tree in stage) / len(stage)

        total_sum += stage_sum

    importances = total_sum / len(self.estimators_)

    return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_

                for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):

    total_sum = np.zeros((self.n_features, ), dtype=np.float64)

    for tree in self.estimators_:

        total_sum += tree.feature_importances_

    importances = total_sum / len(self.estimators_)

    return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):

    """Computes the importance of each feature (aka variable)."""

    while node != end_node:

        if node.left_child != _TREE_LEAF:

            # ... and node.right_child != _TREE_LEAF:

            left = &nodes[node.left_child]

            right = &nodes[node.right_child]

            importance_data[node.feature] += (

                node.weighted_n_node_samples * node.impurity -

                left.weighted_n_node_samples * left.impurity -

                right.weighted_n_node_samples * right.impurity)

        node += 1

    importances /= nodes[0].weighted_n_node_samples

    return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (

    node.weighted_n_node_samples * node.impurity -

    left.weighted_n_node_samples * left.impurity -

    right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

arcgisJs之featureLayer中feature的获取
arcgisJs之featureLayer中feature的获取在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...
XGBoost中参数调整的完整指南（包含Python中的代码）
(搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日介绍如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...
一步一步理解GB、GBDT、xgboost
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
GBDT,Adaboosting概念区分 GBDT与xgboost区别
http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...
机器学习（八）—GBDT 与 XGBOOST
RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性. 根据个体学习器的生成方式,目前 ...
GB、GBDT、XGboost理解
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
提升学习算法简述：AdaBoost, GBDT和XGBoost
1. 历史及演进提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...
机器学习总结（一） Adaboost,GBDT和XGboost算法
一: 提升方法概述提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...
机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

[BZOJ3195][Jxoi2012]奇怪的道路
3195: [Jxoi2012]奇怪的道路 Time Limit: 10 Sec Memory Limit: 128 MB Description 小宇从历史书上了解到一个古老的文明.这个文明在各个 ...
castle activerecord 学习过程出现的问题
优点: 1.CRUD:代码简洁 2.不用配置map 3.自带事务方便 4.自带IOC 5.自带数据有效性验证缺点: 1.自增长(Oracle 一直提示序号不存在,有空继续尝试) 2.多条件,直接用 ...
【模考】2018.04.08 Connection
Description 给定一张N个点M条边的连通无向图,问最少需要断开多少条边使得这张图不再连通. Input 第一行两个整数N,M含义如题所示. 接下来M行,每行两个正整数x,y,表示x和y之间有 ...
Mybatis笔记四：nested exception is org.apache.ibatis.reflection.ReflectionException: There is no getter for property named 'id' in 'class java.lang.String'
错误异常:nested exception is org.apache.ibatis.reflection.ReflectionException: There is no getter for pr ...
Unity3D手游开发日记(1) - 移动平台实时阴影方案
阴影这个东西,说来就话长了,很多年前人们就开始研究出各种阴影技术,但都存在各种瑕疵和问题,直到近几年出现了PSSM,也就是CE3的CSM,阴影技术才算有个比较完美的解决方案.Unity自带的实时阴影, ...
【CodeChef】ForbiddenSum
Portal --> CC ForbiddenSum Solution 场上想到了\(O(NM)\)的做法..然而并没有什么用首先考虑比较简单的一个问题,给定一个数组\(A\),问这些数不能凑 ...
【bzoj4004】装备购买
Portal-->bzoj4004 Solution 这题的话..其实就是求\(n\)个\(m\)维向量的极大线性无关组,并且要求权值最大然后套路什么的跟Portal-->bzoj310 ...
那些你不常用却非常有用的MySql语句和命令
操作数据库关于数据库的操作比较少,主要是:看.建.用.删. 查看数据库获取服务器上的数据库列表通常很有用.执行show databases;命令就可以搞定. 1 mysql> show da ...
winform布局 FlowLayoutPanel的控件
http://www.cnblogs.com/moon-mountain/archive/2011/09/08/2171232.html 1.采用流布局:工具箱里边容器里有一个:FlowLayoutP ...
Kruskal-Wallis test
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录视频) https://study.163.com/course/introduction.htm?courseId=1005269003&u ...

gbdt和xgboost中feature importance的获取

gbdt和xgboost中feature importance的获取的更多相关文章

随机推荐

热门专题