author:yangjing

time:2018-10-22

Gradient boosting decision tree

1.main diea

The main idea behind GBDT is to combine many simple models(also known as week kernels),like shallow trees.Each tree can only provide good predictions on part of the data,and so more and more trees are added to iteratively improve performance.

2.parameters setting

the algorithm is a bit more sensitive to parameter settings than random forests,but can provide better accuracy if the parameters are set correctly.

number of trees

By increasing n_estimators ,also increasing the model complexity,as the model has more chances to correct misticks on the training set.
learning rate

controns how strongly each tree tries to correct the misticks of the previous trees.A higher learning rate means each tree can make stronger correctinos,allowing for more complex models.
max_depth

or alternatively max_leaf_nodes.Usyally max_depth is set very low for gradient-boosted models,often not deeper than five splits.

3.code

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

cancer=load_breast_cancer()

X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)

gbrt=GradientBoostingClassifier(random_state=0)

gbrt.fit(X_train,y_train)

gbrt.score(X_test,y_test)

In [261]: X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)

     ...: gbrt=GradientBoostingClassifier(random_state=0)

     ...: gbrt.fit(X_train,y_train)

     ...: gbrt.score(X_test,y_test)

     ...:

Out[261]: 0.958041958041958

In [262]: gbrt.feature_importances_

Out[262]:

array([0.01337291, 0.04201687, 0.0208666 , 0.01889077, 0.01028091,

       0.03215986, 0.02074619, 0.11678956, 0.00820024, 0.00074312,

       0.02042134, 0.00680047, 0.02023052, 0.03907398, 0.05406751,

       0.04795741, 0.02358101, 0.00934718, 0.00593481, 0.0239241 ,

       0.05354265, 0.06160083, 0.10961728, 0.07395201, 0.01867851,

       0.03842953, 0.01915824, 0.07128703, 0.01773659, 0.00059199])

In [263]: gbrt.learning_rate

Out[263]: 0.1

In [264]: gbrt.max_depth

Out[264]: 3

In [265]: len(gbrt.estimators_)

Out[266]: 100

In [272]: gbrt.get_params()

Out[272]:

{'criterion': 'friedman_mse',

 'init': None,

 'learning_rate': 0.1,

 'loss': 'deviance',

 'max_depth': 3,

 'max_features': None,

 'max_leaf_nodes': None,

 'min_impurity_decrease': 0.0,

 'min_impurity_split': None,

 'min_samples_leaf': 1,

 'min_samples_split': 2,

 'min_weight_fraction_leaf': 0.0,

 'n_estimators': 100,

 'presort': 'auto',

 'random_state': 0,

 'subsample': 1.0,

 'verbose': 0,

 'warm_start': False}

Random forest

In [230]: y

Out[230]:

array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,

       0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,

       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,

       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,

       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int64)

In [231]: axes.ravel()

Out[231]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F3694A8>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46C099F28>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46E6E3BE0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46BEB72E8>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46ED67198>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F292C88>],

      dtype=object)

In [232]: from sklearn.model_selection import train_test_split

In [233]: X_trai,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)

In [234]: len(X_trai)

Out[234]: 75

In [235]: fores=RandomForestClassifier(n_estimators=5,random_state=2)

In [236]: fores.fit(X_trai,y_train)

Out[236]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

            max_depth=None, max_features='auto', max_leaf_nodes=None,

            min_impurity_decrease=0.0, min_impurity_split=None,

            min_samples_leaf=1, min_samples_split=2,

            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,

            oob_score=False, random_state=2, verbose=0, warm_start=False)

In [237]: fores.score(X_test,y_test)

Out[237]: 0.92

In [238]: fores.estimators_

Out[238]:

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

             max_features='auto', max_leaf_nodes=None,

             min_impurity_decrease=0.0, min_impurity_split=None,

             min_samples_leaf=1, min_samples_split=2,

             min_weight_fraction_leaf=0.0, presort=False,

             random_state=1872583848, splitter='best'),

 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

             max_features='auto', max_leaf_nodes=None,

             min_impurity_decrease=0.0, min_impurity_split=None,

             min_samples_leaf=1, min_samples_split=2,

             min_weight_fraction_leaf=0.0, presort=False,

             random_state=794921487, splitter='best'),

 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

             max_features='auto', max_leaf_nodes=None,

             min_impurity_decrease=0.0, min_impurity_split=None,

             min_samples_leaf=1, min_samples_split=2,

             min_weight_fraction_leaf=0.0, presort=False,

             random_state=111352301, splitter='best'),

 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

             max_features='auto', max_leaf_nodes=None,

             min_impurity_decrease=0.0, min_impurity_split=None,

             min_samples_leaf=1, min_samples_split=2,

             min_weight_fraction_leaf=0.0, presort=False,

             random_state=1853453896, splitter='best'),

 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

             max_features='auto', max_leaf_nodes=None,

             min_impurity_decrease=0.0, min_impurity_split=None,

             min_samples_leaf=1, min_samples_split=2,

             min_weight_fraction_leaf=0.0, presort=False,

             random_state=213298710, splitter='best')]

GBDT,随机森林的更多相关文章

ObjectT5：在线随机森林-Multi-Forest-A chameleon in track in
原文::Multi-Forest:A chameleon in tracking,CVPR2014 下的蛋...原文使用随机森林的优势,在于可以使用GPU把每棵树分到一个流处理器里运行,容易并行化 ...
机器学习中的算法(1)-决策树模型组合之随机森林与GBDT
版权声明: 本文由LeftNotEasy发布于http://leftnoteasy.cnblogs.com, 本文可以被全部的转载或者部分使用,但请注明出处,如果有问题,请联系wheeleast@gm ...
机器学习中的算法——决策树模型组合之随机森林与GBDT
前言: 决策树这种算法有着很多良好的特性,比如说训练时间复杂度较低,预测的过程比较快速,模型容易展示(容易将得到的决策树做成图片展示出来)等.但是同时,单决策树又有一些不好的地方,比如说容易over- ...
决策树模型组合之（在线）随机森林与GBDT
前言: 决策树这种算法有着很多良好的特性,比如说训练时间复杂度较低,预测的过程比较快速,模型容易展示(容易将得到的决策树做成图片展示出来)等.但是同时, 单决策树又有一些不好的地方,比如说容易over ...
机器学习中的算法-决策树模型组合之随机森林与GBDT
机器学习中的算法(1)-决策树模型组合之随机森林与GBDT 版权声明: 本文由LeftNotEasy发布于http://leftnoteasy.cnblogs.com, 本文可以被全部的转载或者部分使 ...
随机森林与GBDT
前言: 决策树这种算法有着很多良好的特性,比如说训练时间复杂度较低,预测的过程比较快速,模型容易展示(容易将得到的决策树做成图片展示出来)等.但是同时,单决策树又有一些不好的地方,比如说容易over- ...
决策树模型组合之随机森林与GBDT
版权声明: 本文由LeftNotEasy发布于http://leftnoteasy.cnblogs.com, 本文可以被全部的转载或者部分使用,但请注明出处,如果有问题,请联系wheeleast@gm ...
随机森林和GBDT
1. 随机森林 Random Forest(随机森林)是Bagging的扩展变体,它在以决策树为基学习器构建Bagging集成的基础上,进一步在决策树的训练过程中引入了随机特征选择,因此可以概括RF ...
随机森林RF、XGBoost、GBDT和LightGBM的原理和区别
目录 1.基本知识点介绍 2.各个算法原理 2.1 随机森林 -- RandomForest 2.2 XGBoost算法 2.3 GBDT算法(Gradient Boosting Decision T ...

随机推荐

差分+树状数组【p4868】Preprefix sum
Description 前缀和(prefix sum)$S_i=\sum_{k=1}^i a_i$. 前前缀和(preprefix sum) 则把$S_i$作为原序列再进行前缀和.记再次求得前 ...
[P2396] yyy loves Maths VII
Link: P2396 传送门 Solution: 一眼能看出$O(n*2^n)$的状压$dp$ 但此题是个卡常题,$n=23/24$的时候就别想过了这题算是提供了一种对状压$dp$的优化思路吧原 ...
由SequenceFile.Writer(key,value)谈toString()方法
之前有篇博客(http://www.cnblogs.com/lz3018/p/5243503.html)介绍以SequenceFile作为输入源进行矩阵乘法的过程,首先是将矩阵存储到SequenceF ...
JavaScript学习系列之执行上下文与变量对象篇
一个热爱技术的菜鸟...用点滴的积累铸就明日的达人正文在上一篇文章中讲解了JavaScript内存模型,其中有提到执行上下文与变量对象的概念.对于JavaScript开发者来说,理解执行上下文与变 ...
linux 处理两个文件的并集,交集,计数
1. 取出两个文件的并集(重复的行只保留一份) cat file1 file2 | sort | uniq 2. 取出两个文件的交集(只留下同时存在于两个文件中的文件) cat file1 file2 ...
menuStrip鼠标经过自动显示菜单
//--------------------------------------------------------------------------------- private void For ...
Texygen文本生成,交大计算机系14级的朱耀明
文本生成哪家强?上交大提出基准测试新平台 Texygen 2018-02-12 13:11测评新智元报道来源:arxiv 编译:Marvin [新智元导读]上海交通大学.伦敦大学学院朱耀明, 卢思 ...
iOS：延迟加载和上拉刷新/下拉加载的实现
lazy懒加载(延迟加载)UITableView 举个例子,当我们在用网易新闻App时,看着那么多的新闻,并不是所有的都是我们感兴趣的,有的时候我们只是很快的滑过,想要快速的略过不喜欢的内容,但是只要 ...
Python数据整合与数据准备-BigGorilla应用
一.前言要应用BigGorilla框架对应数据进行数据的处理与匹配,那么首先要下载Anaconda安装,下载地址:https://www.continuum.io/downloads Anacond ...
作为Java程序员应该掌握的10项技能
本文详细罗列了作为Java程序员应该掌握的10项技能.分享给大家供大家参考.具体如下: 1.语法:必须比较熟悉,在写代码的时候IDE的编辑器对某一行报错应该能够根据报错信息知道是什么样的语法错误并且知 ...

GBDT,随机森林