An Attempt to Understand Boosting Algorithm(s)

WELCOME!

Here you will find daily news and tutorials about R, contributed by over 573 bloggers. 
There are many ways tofollow us - 
By e-mail:  On Facebook: 
If you are an R blogger yourself you are invited to add your own R content feed to this site(Non-English R bloggers should add themselves- here)

RECENT POSTS

An Attempt to Understand Boosting Algorithm(s)

June 26, 2015

By arthur charpentier

 
(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking atwikipedia page.

Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones

One should admit that it is not very informative. At least, there is the idea that ‘weak learners’ can be used to get a good predictor. Now, to be honest, I guess I understand the concept. But I still can’t reproduce what I got with standard ‘boosting’ packages.

There are a lot of publications about the concept of ‘boosting’. In 1988, Michael Kearns published Thoughts on Hypothesis Boosting, which is probably the oldest one. About the algorithms, it is possible to find some references. Consider for instance Improving Regressors using Boosting Techniques, by Harris Drucker. Or The Boosting Approach to Machine Learning An Overview by Robert Schapire, among many others. In order to illustrate the use of boosting in the context of regression (and not classification, since I believe it provides a better visualisation) consider the section in Dong-Sheng Cao’s In The boosting: A new idea of building models.

In a very general context, consider a model like

The idea is to write it as

or, as we will seen soon,

(where ‘s will be some shrinkage parameters). To get all the components of that sum, we will use an iterative procedure. Define the partial sum (that will be our prediction at step )

Since we consider some regression function here, use the  loss function, to get the  function, we solve

(we can imagine that the loss function can be changed, for instance in the context of classification).

The concept is simple, but from a practical perspective, it is actually a difficult problem since optimization is performed here in a very large set (a functional space actually). One of the trick will be to use a base of learners. Weak learners. And to make sure that we don’t use too strong learners, consider also some shrinkage parameters, as discussed previously. The iterative algorithm is

  • start with some regression model 
  • compute the residuals, including some shrinkage parameter,

then the strategy is to model those residuals

  • at step , consider regression 
  • update the residuals 

and to loop. Then set

So far, I guess I understand the concept. The next step is then to write the code to see how it works, for real. One can easily get the intuition that, indeed, it should work, and we should end up with a decent model. But we have to try it, and play with it to check that it performs better than any other algorithms.

Consider the following dataset,

n=300
set.seed(1)
u=sort(runif(n)*2*pi)
y=sin(u)+rnorm(n)/4
df=data.frame(x=u,y=y)

If we visualize it, we get

plot(df)

Consider some linear-by-part regression models. It could make sense here. At each iterations, there are 7 parameters to ‘estimate’, the slopes and the nodes. Here, consider some constant shrinkage parameter (there is no need to start with something to complicated, I guess).

v=.05
library(splines)
fit=lm(y~bs(x,degree=1,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp

I store in the original dataset the residuals (that will be updated), and I keep tracks of all the predictions. Consider now the following loop

for(t in 1:100){
fit=lm(yr~bs(x,degree=1,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

This is the implementation of the algorithm described above, right? To visualise it, at some early stage, use

nd=data.frame(x=seq(0,2*pi,by=.01))
viz=function(M){
if(M==1) y=YP[,1]
if(M>1) y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="",xlab="")
lines(df$x,y,type="l",col="red",lwd=3)
fit=lm(y~bs(x,degree=1,df=3),data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="l",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}

The red line is the initial guess we have, without boosting, using a simple call of the regression function. The blue one is the one obtained using boosting. The dotted line is the truemodel.

viz(50)

Somehow, boosting is working. But even the possibility to get different notes at each step, it looks like we don’t use it. And we cannot perform better than a simple regression function.

What if we use quadratic splines instead of linear splines?

v=.05
fit=lm(y~bs(x,degree=2,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
library(splines)
for(t in 1:100){
fit=lm(yr~bs(x,degree=2,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

Again, boosting is not improving anything here. We’ll discuss later on the impact of the shrinkage parameter, but here, it won’t change much the output (it might be faster of slower to reach the final prediction, but it will always be the same predictive model).

In order to get something different at each step, it tried to add a boostrap procedure. I don’t know if that’s sill ‘boosting’, but why not try it.

v=.1
idx=sample(1:n,size=n,replace=TRUE)
fit=lm(y~bs(x,degree=1,df=3),data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
 
for(t in 1:100){
idx=sample(1:n,size=n,replace=TRUE)
fit=lm(yr~bs(x,degree=1,df=3),data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

At each step, I sample from my dataset, and get a linear-by-parts regression. And again, I use a shrinkage parameter not to learn too fast.

It is slightly different (if you look very carefully). But actually, an algorithm that will be as costly is a ‘bagging’ one, where we boostrap many samples, get different models, and predictions, and then average all the predictions. The (computational) cost is exactly the same here

YP=NULL
library(splines)
for(t in 1:100){
idx=sample(1:n,size=n,replace=TRUE)
fit=lm(y~bs(x,degree=1,df=3),data=df[idx,])
yp=predict(fit,newdata=nd)
YP=cbind(YP,yp)
}
y=apply(YP[,1:100],1,mean)
plot(df$x,df$y,ylab="",xlab="")
lines(nd$x,y,type="l",col="purple",lwd=3)

It is very close to what we got with the boosting procedure.

Let us try something else. What if we consider at each step a regression tree, instead of a linear-by-parts regression.

library(rpart)
v=.1
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
fit=rpart(yr~x,data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

Again, to visualise the learning process, use

viz=function(M){
y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="",xlab="")
lines(df$x,y,type="s",col="red",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="s",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}

This time, with those trees, it looks like not only we have a good model, but also a different model from the one we can get using a single regression tree.

What if we change the shrinkage parameter?

viz=function(v=0.05){
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
fit=rpart(yr~x,data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}
y=apply(YP,1,sum)
plot(df$x,df$y,xlab="",ylab="")
lines(df$x,y,type="s",col="red",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type="s",col="blue",lwd=3)
lines(nd$x,sin(nd$x),lty=2)
}

There is clearly an impact of that parameter. It has to be small to get a good model. This is the idea of using ‘weak learners’ to get a good prediction.

If we add a boostrap sample selection , we get also a good predictive model, here

v=.1
idx=sample(1:n,size=n,replace=TRUE)
fit=rpart(y~x,data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
idx=sample(1:n,size=n,replace=TRUE)
fit=rpart(yr~x,data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

It looks like using a small shrinkage parameter, and some regression tree at each step, we get some ‘weak learners’. It performs well, but so far, I do not see how it could perform better than standard econometric models. But I am still working on it.

An Attempt to Understand Boosting Algorithm(s)的更多相关文章

  1. A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning

    A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee on S ...

  2. How to Configure the Gradient Boosting Algorithm

    How to Configure the Gradient Boosting Algorithm by Jason Brownlee on September 12, 2016 in XGBoost ...

  3. 转:The Knuth-Morris-Pratt Algorithm in my own words

    The Knuth-Morris-Pratt Algorithm in my own words For the past few days, I’ve been reading various ex ...

  4. 机器学习技法:08 Adaptive Boosting

    Roadmap Motivation of Boosting Diversity by Re-weighting Adaptive Boosting Algorithm Adaptive Boosti ...

  5. Gradient Boosting, Decision Trees and XGBoost with CUDA ——GPU加速5-6倍

    xgboost的可以参考:https://xgboost.readthedocs.io/en/latest/gpu/index.html 整体看加速5-6倍的样子. Gradient Boosting ...

  6. 机器学习技法笔记:08 Adaptive Boosting

    Roadmap Motivation of Boosting Diversity by Re-weighting Adaptive Boosting Algorithm Adaptive Boosti ...

  7. Python中Gradient Boosting Machine(GBM)调参方法详解

    原文地址:Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python by Aarshay Jain 原文翻译与校对 ...

  8. Tree - Gradient Boosting Machine with sklearn source code

    This is the second post in Boosting algorithm. In the previous post, we go through the earliest Boos ...

  9. The Knuth-Morris-Pratt Algorithm in my own words(转)

    origianl For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt str ...

随机推荐

  1. PullToRefreshListView调用onRefreshComplete方法 无法取消刷新的bug

    我们在使用框架:   PullToRefreshListView 实现下拉或者上拉加载时候,可能在上拉 完成时候,调用onRefreshComplete方法去 停止 刷新操作,但是,可能无效,测试产生 ...

  2. hadoop处理Excel通话记录

    前面我们所写mr程序的输入都是文本文件,但真正工作中我们难免会碰到需要处理其它格式的情况,下面以处理excel数据为例 1.项目需求 有刘超与家庭成员之间的通话记录一份,存储在Excel文件中,如下面 ...

  3. C#线程池ThreadPool.QueueUserWorkItem接收线程执行的方法返回值

    最近在项目中需要用到多线程,考虑了一番,选择了ThreadPool,我的需求是要拿到线程执行方法的返回值, 但是ThreadPool.QueueUserWorkItem的回调方法默认是没有返回值的,搜 ...

  4. DGV属性

    1.控件的SelectedCells.Count属性可以判断用户是否已经选择数据,如果大于0说明有选择的数据. 2.SelectedCells[N].Value的属性可以获取某一行数据中某列的数据,其 ...

  5. [中级] 有效删除URL中的index.php

    如果你刚接触CI不久又或者刚刚研读CI的使用手册的话,关于如何有效删除URL中index.php以使URL看起来更友好美观的问题,可能是你面对的第一个较为复杂的问题!本贴不是原创,而是一个各种意见的综 ...

  6. hadoop 分片与分块,map task和reduce task的理解

    分块:Block HDFS存储系统中,引入了文件系统的分块概念(block),块是存储的最小单位,HDFS定义其大小为64MB.与单磁盘文件系统相似,存储在 HDFS上的文件均存储为多个块,不同的是, ...

  7. csv文本编辑引号问题

    今天发现一个csv的一个问题,csv工具类对于引号默认有特殊的处理.我希望写出来的结果是 1,"1",1 原来的代码是 CsvWriter cw=new CsvWriter(&qu ...

  8. C++中new的用法

    new int;//开辟一个存放整数的存储空间,返回一个指向该存储空间的地址(即指针) new int(100);//开辟一个存放整数的空间,并指定该整数的初值为100,返回一个指向该存储空间的地址 ...

  9. meta里面的viewport属性

    <meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1, ...

  10. 如何用DOS 链接mysql

    1.Ctrl+R 打开DOS窗口 2.键入 cd\ 回车进入C盘根目录 3.进入mysql bin目录下 操作mysql命令 4.输入连接数据库命令 mysql -hlocalhost -uroot ...