来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum importances = total_sum / len(self.estimators_)
return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable).""" while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child] importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1 importances /= nodes[0].weighted_n_node_samples return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

  1. arcgisJs之featureLayer中feature的获取

    arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...

  2. XGBoost中参数调整的完整指南(包含Python中的代码)

    (搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日     介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...

  3. 一步一步理解GB、GBDT、xgboost

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  4. GBDT,Adaboosting概念区分 GBDT与xgboost区别

    http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...

  5. 机器学习(八)—GBDT 与 XGBOOST

    RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性.  根据个体学习器的生成方式,目前 ...

  6. GB、GBDT、XGboost理解

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  7. 提升学习算法简述:AdaBoost, GBDT和XGBoost

    1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...

  8. 机器学习总结(一) Adaboost,GBDT和XGboost算法

    一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...

  9. 机器学习算法总结(四)——GBDT与XGBOOST

    Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

  1. Hyperledger Fabric 实战(十二): Fabric 源码本地调试

    借助开发网络调试 fabric 源码本地调试 准备工作 IDE Goland Go 1.9.7 fabric-samples 模块 chaincode-docker-devmode fabric 源码 ...

  2. elasticsearch 第一篇(入门篇)

    介绍 elasticsearch是一个高效的.可扩展的全文搜索引擎 基本概念 Near Realtime(NRT): es是一个接近实时查询平台,意味从存储一条数据到可以索引到数据时差很小,通常在1s ...

  3. 【POJ1741】Tree(点分治)

    [POJ1741]Tree(点分治) 题面 Vjudge 题目大意: 求树中距离小于\(K\)的点对的数量 题解 完全不觉得点分治了.. 简直\(GG\),更别说动态点分治了... 于是来复习一下. ...

  4. 在华为eNSP中配置简单的DHCP

    拓扑图,如图1 图1 在AR1中的配置过程如图2 图2 通过PC1查看是否分配了地址,如图3 图3

  5. 洛谷p1017 进制转换(2000noip提高组)

    洛谷P1017 进制转换 题意分析 给出一个数n,要求用负R进制显示. n∈[-32768,32767].R ∈[-20,-2] 考察的是负进制数的转换,需要理解短除法. 看到这道题的时候,我是比较蒙 ...

  6. 排座位&&Little Elephant And Permutation——排列dp的处理

    排列的问题,就是要把序列排个序,使之达到某种最优值或者统计方案数 dp可以解决部分排列问题. 通常的解决方案是,按照编号(优先级)排序决策,从左到右决策两种. 这里主要是第一个. 排座位• 有

  7. poj 2176 folding

    Description:   就是把一个字符串压尽可能的压缩 #include<iostream> #include<cstring> #include<cstdio&g ...

  8. Spring MVC POJO入参过程分析

    SpringMVC确定目标方法POJO类型的入参过程 1.确认一个key: (1).若目标方法的POJO类型的参数没有使用@ModelAttribute作为修饰,则key为POJO类名第一个字母的小写 ...

  9. 洛谷P1588 丢失的牛

    P1588 丢失的牛 158通过 654提交 题目提供者JOHNKRAM 标签USACO 难度普及/提高- 时空限制1s / 128MB 提交  讨论  题解 最新讨论更多讨论 答案下载下来是对的,但 ...

  10. 洛谷P2617 Dynamic Rankings (主席树)

    洛谷P2617 Dynamic Rankings 题目描述 给定一个含有n个数的序列a[1],a[2],a[3]--a[n],程序必须回答这样的询问:对于给定的i,j,k,在a[i],a[i+1],a ...