来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum importances = total_sum / len(self.estimators_)
return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable).""" while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child] importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1 importances /= nodes[0].weighted_n_node_samples return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

  1. arcgisJs之featureLayer中feature的获取

    arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...

  2. XGBoost中参数调整的完整指南(包含Python中的代码)

    (搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日     介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...

  3. 一步一步理解GB、GBDT、xgboost

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  4. GBDT,Adaboosting概念区分 GBDT与xgboost区别

    http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...

  5. 机器学习(八)—GBDT 与 XGBOOST

    RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性.  根据个体学习器的生成方式,目前 ...

  6. GB、GBDT、XGboost理解

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  7. 提升学习算法简述:AdaBoost, GBDT和XGBoost

    1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...

  8. 机器学习总结(一) Adaboost,GBDT和XGboost算法

    一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...

  9. 机器学习算法总结(四)——GBDT与XGBOOST

    Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

  1. ACID和CAP, BASE

      ACID:关系型数据库中事务的4个属性:   Atomicity,原子性,整个事务的所有操作,要么全部完成,要么全部不完成,不可能停滞在中间的某个环节.事务在执行过程中出错,会回滚到事务开始前的状 ...

  2. 【Learning】一步步地解释Link-cut Tree

    简介 Link-cut Tree,简称LCT. 干什么的?它是树链剖分的升级版,可以看做是动态的树剖. 树剖专攻静态树问题:LCT专攻动态树问题,因为此时的树剖面对动态树问题已经无能为力了(动态树问题 ...

  3. Hive(四)hive函数与hive shell

    一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/displ ...

  4. bzoj3192: [JLOI2013]删除物品(树状数组)

    既然要从一个堆的堆顶按顺序拿出来放到第二个堆的堆顶,那么我们就可以把两个堆顶怼在一起,这样从一个堆拿到另一个堆只需要移动指针就好了. 换句话说,把1~n倒着,n+1到n+m正着,用一个指针把两个序列分 ...

  5. openresty--centos7下开发环境安装

    1. 安装依赖的软件包 yum install readline-devel pcre-devel openssl-devel gcc 2. 安装openresty -- 1. 下载openresty ...

  6. laravel 5.1 Eloquent常见问题

    1.新增一条记录以及判断是否新增成功 $instance = XxxModel::create(['a' => 1, 'b' => 2]); if ($instance->exist ...

  7. python基础7--集合

    集合set Python的set集合是一个无序不重复元素集.基本功能包括关系测试和消除重复元素.集合对象还支持union(并集).intersection(交集).difference(差集) 和 s ...

  8. 深入分析Java中的 == 和equals

    关于Java中的 == 和equals的解释请看这位博主的文章 :http://www.cnblogs.com/dolphin0520/p/3592500.html 以下是我对这篇文章的一些扩展. 对 ...

  9. [DeeplearningAI笔记]序列模型1.10-1.12LSTM/BRNN/DeepRNN

    5.1循环序列模型 觉得有用的话,欢迎一起讨论相互学习~Follow Me 1.10长短期记忆网络(Long short term memory)LSTM Hochreiter S, Schmidhu ...

  10. Excel 报表导入导出

    使用 Excel 进行报表的导入导出,首先下载相关的 jar 和 excel util. Excel Util 下载地址 引入依赖: <!-- poi office --> <dep ...