来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum importances = total_sum / len(self.estimators_)
return importances

This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances

So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable).""" while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child] importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1 importances /= nodes[0].weighted_n_node_samples return importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

gbdt和xgboost中feature importance的获取的更多相关文章

  1. arcgisJs之featureLayer中feature的获取

    arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...

  2. XGBoost中参数调整的完整指南(包含Python中的代码)

    (搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日     介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...

  3. 一步一步理解GB、GBDT、xgboost

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  4. GBDT,Adaboosting概念区分 GBDT与xgboost区别

    http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...

  5. 机器学习(八)—GBDT 与 XGBOOST

    RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性.  根据个体学习器的生成方式,目前 ...

  6. GB、GBDT、XGboost理解

    GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...

  7. 提升学习算法简述:AdaBoost, GBDT和XGBoost

    1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...

  8. 机器学习总结(一) Adaboost,GBDT和XGboost算法

    一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...

  9. 机器学习算法总结(四)——GBDT与XGBOOST

    Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...

随机推荐

  1. 如何实现密码输入框focus状态弹出提示信息

    一.密码输入提示框样式实现 效果图如下: 源码如下: <html> <style type="text/css"> *{ padding: 0; margi ...

  2. C++中关于new及动态内存分配的思考

    如何实现一个malloc? malloc_tutorial.pdf ———————————————————————————————————— 我们知道,使用malloc/calloc等分配内存的函数时 ...

  3. CF521D Shop 贪心

    题意: \(n\)个数,有\(m\)个操作,形如: 1,将\(x_i\)​改成\(val_i\)​ 2,将\(x_i\)加上\(val_i\)​ 3,将\(x_i\)​乘上\(val_i\) 其中第\ ...

  4. bzoj2396: 神奇的矩阵(矩阵乘法+随机化)

    这题n三方显然会GG... 运用矩阵乘法的性质A*B*R=A*(B*R)=C*R,于是随机化出一个一列的R,就可以把复杂度降低成n方...大概率是不会错的 #include<iostream&g ...

  5. Vue项目SEO优化的另一种姿态

    背景:当前项目首页和登陆后的平台在一个项目里,路由采用hash模式,现在要做SEO优化,这时候同构SSR(Server Side Rendering)服务端渲染代价显然太大,影响范围比较广,同样更改当 ...

  6. mac和linux下sed使用的不同

    http://note.youdao.com/noteshare?id=84a6a0eb3e3f1698c9de4f48db24226e

  7. Codeforces 864E dp

    题意: 房间着火了,里面有n件物品,每件物品有营救需要的时间t,被烧坏的最晚时间d,他的价值p,问能得到的最大价值,并且输出营救出来的物品编号 代码: //必然是先救存活时间短的即d小的,所以先排个序 ...

  8. 「Linux」centos7更新python3.6后yum报错问题

    1. #vi /usr/bin/yum 因为我的旧版本是2.7,所以将#!/usr/bin/python改为#!/usr/bin/python2.7就可以了! 退出保存 2.可能还会报错 就修改/us ...

  9. 「PLC」PLC的硬件与工作原理

  10. django 配置xamdin遇到的坑

    是在 Django==1.11.7 这个版本下配置的,需要说明的是,不是通过pip install xadmin方式安装的 在github上下载的xadmin源码包,需要在项目的根目录下创建extra ...