gbdt和xgboost中feature importance的获取
来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要
I'll use the sklearn code, as it is generally much cleaner than the R code.
Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum
importances = total_sum / len(self.estimators_)
return importances
This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as
def feature_importances_(self):
total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for tree in self.estimators_:
total_sum += tree.feature_importances_
importances = total_sum / len(self.estimators_)
return importances
So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.
The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
return importances
This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)
importances /= nodes[0].weighted_n_node_samples
It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.
gbdt和xgboost中feature importance的获取的更多相关文章
- arcgisJs之featureLayer中feature的获取
arcgisJs之featureLayer中feature的获取 在featureLayer中source可以获取到一个Graphic数组,但是这个数组属于原数据数组.当使用 applyEdits修改 ...
- XGBoost中参数调整的完整指南(包含Python中的代码)
(搬运)XGBoost中参数调整的完整指南(包含Python中的代码) AARSHAY JAIN, 2016年3月1日 介绍 如果事情不适合预测建模,请使用XGboost.XGBoost算法已 ...
- 一步一步理解GB、GBDT、xgboost
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- GBDT,Adaboosting概念区分 GBDT与xgboost区别
http://blog.csdn.net/w28971023/article/details/8240756 ============================================= ...
- 机器学习(八)—GBDT 与 XGBOOST
RF.GBDT和XGBoost都属于集成学习(Ensemble Learning),集成学习的目的是通过结合多个基学习器的预测结果来改善单个学习器的泛化能力和鲁棒性. 根据个体学习器的生成方式,目前 ...
- GB、GBDT、XGboost理解
GBDT和xgboost在竞赛和工业界使用都非常频繁,能有效的应用到分类.回归.排序问题,虽然使用起来不难,但是要能完整的理解还是有一点麻烦的.本文尝试一步一步梳理GB.GBDT.xgboost,它们 ...
- 提升学习算法简述:AdaBoost, GBDT和XGBoost
1. 历史及演进 提升学习算法,又常常被称为Boosting,其主要思想是集成多个弱分类器,然后线性组合成为强分类器.为什么弱分类算法可以通过线性组合形成强分类算法?其实这是有一定的理论基础的.198 ...
- 机器学习总结(一) Adaboost,GBDT和XGboost算法
一: 提升方法概述 提升方法是一种常用的统计学习方法,其实就是将多个弱学习器提升(boost)为一个强学习器的算法.其工作机制是通过一个弱学习算法,从初始训练集中训练出一个弱学习器,再根据弱学习器的表 ...
- 机器学习算法总结(四)——GBDT与XGBOOST
Boosting方法实际上是采用加法模型与前向分布算法.在上一篇提到的Adaboost算法也可以用加法模型和前向分布算法来表示.以决策树为基学习器的提升方法称为提升树(Boosting Tree).对 ...
随机推荐
- MySQL事务提交过程
一.MySQL事务提交过程(一) MySQL作为一种关系型数据库,已被广泛应用到互联网中的诸多项目中.今天我们来讨论下事务的提交过程. 由于mysql插件式存储架构,导致开启binlog后,事务提交实 ...
- hive 导入数据
1.load data load data local inpath "/home/hadoop/userinfo.txt" into table userinfo; " ...
- DataTable 转换 DataSet
DataTable dt = resuylt.Copy(); var dsR = new DataSet(); ds.Tables.Add(dt);
- 【刷题】BZOJ 4199 [Noi2015]品酒大会
Description 一年一度的"幻影阁夏日品酒大会"隆重开幕了.大会包含品尝和趣味挑战两个环节,分别向优胜者颁发"首席品酒家"和"首席猎手&quo ...
- Codeforces Round #424 (Div. 2, rated, based on VK Cup Finals) A 水 B stl C stl D 暴力 E 树状数组
A. Unimodal Array time limit per test 1 second memory limit per test 256 megabytes input standard in ...
- UESTC--1682
原题链接: 分析:求最小周期的应用. #include<cstdio> #include<cstring> #include<cmath> #include< ...
- [NOI2010] 能量采集 (数学)
[NOI2010] 能量采集 题目描述 栋栋有一块长方形的地,他在地上种了一种能量植物,这种植物可以采集太阳光的能量.在这些植物采集能量后,栋栋再使用一个能量汇集机器把这些植物采集到的能量汇集到一起. ...
- crontab 自动执行脚本
crontab -e ================>自动执行某脚本!!!!!!! 1001 ls 1002 cd /home/wwwroot/default/ 1003 ls 1004 cr ...
- Android开发大纲
知识模块 Java Android 计算机网络 算法 书本知识 深入理解Java虚拟机 Java并发编程实战 疯狂Android讲义 Android开发艺术探索 网络知识 面试经验 面试心得与总结-- ...
- nginx 初探 之反向代理
首先要解释的是什么叫做反向代理? 平时我们浏览网页可以输入网址直接访问, 但如果访问国外的网站, 可能就没那么简单('中国特色'), 这时候我们需要配置一个代理服务器, 然后通过此服务器中转来访 ...