来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum importances = total_sum / len(self.estimators_)
return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable).""" while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child] importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1 importances /= nodes[0].weighted_n_node_samples return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

  1. arcgisJs之featureLayer中feature的获取

    arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...

  2. XGBoost中参数调整的完整指南(包含Python中的代码)

    (搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日     介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...

  3. 一步一步理解GB、GBDT、xgboost

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  4. GBDT,Adaboosting概念区分 GBDT与xgboost区别

    http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...

  5. 机器学习(八)—GBDT 与 XGBOOST

    RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性.  根据个体学习器的生成方式,目前 ...

  6. GB、GBDT、XGboost理解

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  7. 提升学习算法简述:AdaBoost, GBDT和XGBoost

    1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...

  8. 机器学习总结(一) Adaboost,GBDT和XGboost算法

    一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...

  9. 机器学习算法总结(四)——GBDT与XGBOOST

    Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

  1. Connections between cities HDU - 2874(最短路树 lca )

    题意: 给出n个点m条边的图,c次询问 求询问中两个点间的最短距离. 解析: Floyd会T,所以用到了最短路树..具体思想为: 设k为u和v的最近公共祖先 d[i] 为祖结点到i的最短距离  则di ...

  2. BZOJ2006:[NOI2010]超级钢琴——题解

    https://www.lydsy.com/JudgeOnline/problem.php?id=2006 https://www.luogu.org/problemnew/show/P2048#su ...

  3. KinFu --- KinectFusion的开源实现

    KinectFusion是微软研究院的一个项目,研究用Kinect来实时地重构3D表面,最终用于人机交互. 先看视频:http://research.microsoft.com/en-us/proje ...

  4. BZOJ1089 [SCOI2003]严格n元树 【dp + 高精】

    Description 如果一棵树的所有非叶节点都恰好有n个儿子,那么我们称它为严格n元树.如果该树中最底层的节点深度为d (根的深度为0),那么我们称它为一棵深度为d的严格n元树.例如,深度为2的严 ...

  5. 三年java面试题

    前言: 楼主毕业三年,从大学时期就开始一直从事java web方面的开发.我在去年的今天有一篇帖子:两年java面试经验.经历了一年的上班,成长了很多.今年因为某些原因辞职了.从2月底辞职,到3月初, ...

  6. Hdu3022 Sum of Digits

    Sum of Digits Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/65536 K (Java/Others)Tot ...

  7. is_int is_float 判断数值是否为整数

    is_float — 检测变量是否是浮点型 http://cn.php.net/manual/zh/function.is-float.php is_int — 检测变量是否是整数 http://cn ...

  8. duilib CDateTimeUI 在Xp下的bug修复

    转自:http://my.oschina.net/u/343244/blog/370131 CDateTimeUI 的bug修复.修改CDateTimeWnd的HandleMessage方法 ? 1 ...

  9. [DeeplearningAI笔记]序列模型1.7-1.9RNN对新序列采样/GRU门控循环神经网络

    5.1循环序列模型 觉得有用的话,欢迎一起讨论相互学习~Follow Me 1.7对新序列采样 基于词汇进行采样模型 在训练完一个模型之后你想要知道模型学到了什么,一种非正式的方法就是进行一次新序列采 ...

  10. apache源码安装必须依赖的库apr----/etc/ld.so.conf 文件介绍

    Apache所依赖的库,封装了各个系统相关的API等.虽然都是Apache开发的,但是现在最新版本的Apache和APR源码是分开的.要编Apache就必须使用APR. /etc/ld.so.conf ...