一个关于AdaBoost算法的简单证明
本文给出了机器学习中AdaBoost算法的一个简单初等证明,需要使用的数学工具为微积分-1.
Adaboost is a powerful algorithm for predicting models. However, a major disadvantage is that Adaboost may lead to over-fit in the presence of noise. Freund, Y. & Schapire, R. E. (1997) proved that the training error of the ensemble is bounded by the following expression: \begin{equation}\label{ada1}e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)} \end{equation} where $\epsilon_t$ is the error rate of each base classifier $t$. If the error rate is less than 0.5, we can write $\epsilon_t=0.5-\gamma_t$, where $\gamma_t$ measures how much better the classifier is than random guessing (on binary problems). The bound on the training error of the ensemble becomes \begin{equation}\label{ada2} e_{ensemble}\le \prod_{t}\sqrt{1-4{\gamma_t}^2}\le e^{-2\sum_{t}{\gamma_t}^2} \end{equation} Thus if each base classifier is slightly better than random so that $\gamma_t>\gamma$ for some $\gamma>0$, then the training error drops exponentially fast. Nevertheless, because of its tendency to focus on training examples that are misclassified, Adaboost algorithm can be quite susceptible to over-fitting. We will give a new simple proof of \ref{ada1} and \ref{ada2}; additionally, we try to explain why the parameter $\alpha_t=\frac{1}{2}\cdot\log\frac{1-\epsilon_t}{\epsilon_t}$ in boosting algorithm.
AdaBoost Algorithm:
Recall the boosting algorithm is:
Given $(x_1, y_1), (x_2, y_2), \cdots, (x_m, y_m)$, where $x_i\in X, y_i\in Y=\{-1, +1\}$.
Initialize $$D_1(i)=\frac{1}{m}$$ For $t=1, 2, \ldots, T$: Train weak learner using distribution $D_t$.
Get weak hypothesis $h_t: X\rightarrow \{-1, +1\}$ with error \[\epsilon_t=\Pr_{i\sim D_t}[h_t (x_i)\ne y_i]\] If $\epsilon_i >0.5$, then the weights $D_t (i)$ are reverted back to their original uniform values $\frac{1}{m}$.
Choose \begin{equation}\label{boost3} \alpha_t=\frac{1}{2}\cdot \log\frac{1-\epsilon_t}{\epsilon_t} \end{equation}
Update \begin{equation}\label{boost4} D_{t+1}(i)=\frac{D_{t}(i)}{Z_t}\times \left\{\begin{array}{c c} e^{-\alpha_t} & \quad \textrm{if $h_t(x_i)=y_i$}\\ e^{\alpha_t} & \quad \textrm{if $h_t(x_i)\ne y_i$} \end{array} \right. \end{equation} where $Z_t$ is a normalization factor.
Output
the final hypothesis: \[H(x)=\text{sign}\left(\sum_{t=1}^{T}\alpha_t\cdot h_t(x)\right)\]
Proof:
Firstly, we will prove \ref{ada1}, note that $D_{t+1}(i)$ is the distribution and its summation $\sum_{i}D_{t+1}(i)$ equals to 1, hence \[Z_t=\sum_{i}D_{t+1}(i)\cdot Z_t=\sum_{i}D_t(i)\times \left\{\begin{array}{c c} e^{-\alpha_t} & \quad \textrm{if $h_t(x_i)=y_i$}\\ e^{\alpha_t} & \quad \textrm{if $h_t(x_i)\ne y_i$} \end{array} \right.\] \[=\sum_{i:\ h_t(x_i)=y_i}D_t(i)\cdot e^{-\alpha_t}+\sum_{i:\ h_t(x_i)\ne y_i}D_t(i)\cdot e^{\alpha_t}\] \[=e^{-\alpha_t}\cdot \sum_{i:\ h_t(x_i)=y_i}D_t(i)+e^{\alpha_t}\cdot \sum_{i:\ h_t(x_i)\ne y_i}D_t(i)\] \begin{equation}\label{boost5} =e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t \end{equation} In order to find $\alpha_t$ we can minimize $Z_t$ by making its first order derivative equal to 0. \[{[e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t]}^{'}=-e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t=0\] \[\Rightarrow \alpha_t=\frac{1}{2}\cdot \log\frac{1-\epsilon_t}{\epsilon_t}\] which is \ref{boost3} in the boosting algorithm. Then we put $\alpha_t$ back to \ref{boost5} \[Z_t=e^{-\alpha_t}\cdot (1-\epsilon_t)+e^{\alpha_t}\cdot \epsilon_t=e^{-\frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}}\cdot (1-\epsilon_t)+e^{\frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}}\cdot\epsilon_t\] \begin{equation}\label{boost6} =2\sqrt{\epsilon_t\cdot(1-\epsilon_t)} \end{equation} On the other hand, derive from \ref{boost4} we have \[D_{t+1}(i)=\frac{D_t(i)\cdot e^{-\alpha_t\cdot y_i\cdot h_t(x_i)}}{Z_t}=\frac{D_t(i)\cdot e^{K_t}}{Z_t}\] Since the product will either be $1$ if $h_t (x_i )=y_i$ or $-1$ if $h_t (x_i )\ne y_i$. Thus we can write down all of the equations \[D_1(i)=\frac{1}{m}\] \[D_2(i)=\frac{D_1(i)\cdot e^{K_1}}{Z_1}\] \[D_3(i)=\frac{D_2(i)\cdot e^{K_2}}{Z_2}\] \[\ldots\ldots\ldots\] \[D_{t+1}(i)=\frac{D_t(i)\cdot e^{K_t}}{Z_t}\] Multiply all equalities above and obtain \[D_{t+1}(i)=\frac{1}{m}\cdot\frac{e^{-y_i\cdot f(x_i)}}{\prod_{t}Z_t}\] where $f(x_i)=\sum_{t}\alpha_t\cdot h_t(x_i)$. Thus \begin{equation}\label{boost7} \frac{1}{m}\cdot \sum_{i}e^{-y_i\cdot f(x_i)}=\sum_{i}D_{t+1}(i)\cdot\prod_{t}Z_t=\prod_{t}Z_t \end{equation} Note that if $\epsilon_i>0.5$ the data set will be re-sampled until $\epsilon_i\le0.5$. In other words, the parameter $\alpha_t\ge0$ in each valid iteration process. The training error of the ensemble can be expressed as \[e_{ensemble}=\frac{1}{m}\cdot\sum_{i}\left\{\begin{array}{c c} 1 & \quad \textrm{if $y_i\ne h_t(x_i)$}\\ 0 & \quad \textrm{if $y_i=h_t(x_i)$} \end{array} \right. =\frac{1}{m}\cdot \sum_{i}\left\{\begin{array}{c c} 1 & \quad \textrm{if $y_i\cdot f(x_i)\le0$}\\ 0 & \quad \textrm{if $y_i\cdot f(x_i)>0$} \end{array} \right.\] \begin{equation}\label{boost8} \le\frac{1}{m}\cdot\sum_{i}e^{-y_i\cdot f(x_i)}=\prod_{t}Z_t \end{equation} The last step derives from \ref{boost7}. According to \ref{boost6} and \ref{boost8}, we have proved \ref{ada1} \begin{equation}\label{boost9} e_{ensemble}\le \prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)} \end{equation} In order to prove \ref{ada2}, we have to firstly prove the following inequality: \begin{equation}\label{boost10} 1+x\le e^x \end{equation} Or the equivalence $e^x-x-1\ge0$. Let $f(x)=e^x-x-1$, then \[f^{'}(x)=e^x-1=0\Rightarrow x=0\] Since $f^{''}(x)=e^x>0$, so \[{f(x)}_{min}=f(0)=0\Rightarrow e^x-x-1\ge0\] which is desired. Now we go back to \ref{boost9} and let \[\epsilon_t=\frac{1}{2}-\gamma_t\] where $\gamma_t$ measures how much better the classifier is than random guessing (on binary problems). Based on \ref{boost10} we have \[e_{ensemble}\le\prod_{t}2\cdot\sqrt{\epsilon_t\cdot(1-\epsilon_t)}\] \[=\prod_{t}\sqrt{1-4\gamma_t^2}\] \[=\prod_{t}[1+(-4\gamma_t^2)]^{\frac{1}{2}}\] \[\le\prod_{t}(e^{-4\gamma_t^2})^\frac{1}{2}=\prod_{t}e^{-2\gamma_t^2}\] \[=e^{-2\cdot\sum_{t}\gamma_t^2}\] as desired.
一个关于AdaBoost算法的简单证明的更多相关文章
- adaboost算法
三 Adaboost 算法 AdaBoost 是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器,即弱分类器,然后把这些弱分类器集合起来,构造一个更强的最终分类器.(很多博客里说的三个臭皮匠 ...
- 机器学习实战之AdaBoost算法
一,引言 前面几章的介绍了几种分类算法,当然各有优缺.如果将这些不同的分类器组合起来,就构成了我们今天要介绍的集成方法或者说元算法.集成方法有多种形式:可以使多种算法的集成,也可以是一种算法在不同设置 ...
- Adaboost 算法
一 Boosting 算法的起源 boost 算法系列的起源来自于PAC Learnability(PAC 可学习性).这套理论主要研究的是什么时候一个问题是可被学习的,当然也会探讨针对可学习的问题的 ...
- 浅谈 Adaboost 算法
http://blog.csdn.net/haidao2009/article/details/7514787 菜鸟最近开始学习machine learning.发现adaboost 挺有趣,就把自己 ...
- Adaboost算法的一个简单实现——基于《统计学习方法(李航)》第八章
最近阅读了李航的<统计学习方法(第二版)>,对AdaBoost算法进行了学习. 在第八章的8.1.3小节中,举了一个具体的算法计算实例.美中不足的是书上只给出了数值解,这里用代码将它实现一 ...
- Adaboost 算法的原理与推导
0 引言 一直想写Adaboost来着,但迟迟未能动笔.其算法思想虽然简单“听取多人意见,最后综合决策”,但一般书上对其算法的流程描述实在是过于晦涩.昨日11月1日下午,邹博在我组织的机器学习班第8次 ...
- Adaboost算法结合Haar-like特征
Adaboost算法结合Haar-like特征 一.Haar-like特征 目前通常使用的Haar-like特征主要包括Paul Viola和Michal Jones在人脸检测中使用的由Papageo ...
- 前向分步算法 && AdaBoost算法 && 提升树(GBDT)算法 && XGBoost算法
1. 提升方法 提升(boosting)方法是一种常用的统计学方法,在分类问题中,它通过逐轮不断改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的性能 0x1: 提升方法的基本 ...
- Adaboost算法流程及示例
1. Boosting提升方法(源自统计学习方法) 提升方法是一种常用的统计学习方法,应用十分广泛且有效.在分类问题中,它通过改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的 ...
随机推荐
- favicon.ico文件简介
本地调试时,控制台经常会打印如下的错误(对 favicon.ico 的请求在 chrome 调试面板下不可见,可在抓包工具,比如 Fiddler 中看到): favicon.ico 是啥?看下面这张图 ...
- asp.net、 mvc session影响并发
现象:在一个网站中,当访问一个处理比较耗时的页面(A页面),页面请求还没有返回时,此时再点击访问该网站的其他页面(B页面)会出现B页面很久都没有响应和返回,直到A页面输出返回数据时才开始处理B页面的请 ...
- PHP+mysql数据库开发搜索功能:中英文分词+全文检索(MySQL全文检索+中文分词(SCWS))
PHP+mysql数据库开发类似百度的搜索功能:中英文分词+全文检索 中文分词: a) robbe PHP中文分词扩展: http://www.boyunjian.com/v/softd/robb ...
- IIS W3SVC 无法启动1068错误的解决
苦苦寻找解决方法多天之后,终于看到了最简单的处理方法. 故障: 试遍网上各种方法,司马当活马,CMD下输入如下命令,然后重启: fsutil resource setautoreset true C: ...
- 学习Google Protocol buffer之概述
XML这种属于非常强大的一种格式,能存储任何你想存的数据,而且编辑起来还是比较方便的.致命的缺陷在于比较庞大,在某些情况下,序列化和解析都会成为瓶颈.这种对于实时性很强的应用来说,就不太适合了,想象下 ...
- SDN组网相关解决方案
http://www.muzixing.com/pages/2016/02/14/sdnzu-wang-xiang-guan-jie-jue-fang-an.html 2016-02-14 by mu ...
- Go语言interface详解
interface Go语言里面设计最精妙的应该算interface,它让面向对象,内容组织实现非常的方便,当你看完这一章,你就会被interface的巧妙设计所折服. 什么是interface 简单 ...
- 通过HttpUrlConnection下载文件并显示进度条
实现效果: 核心下载块: int count = 0; URL url = new URL("http://hezuo.downxunlei.com/xunlei_hezuo/thunder ...
- 【Tyvj 1060】【NOIP 2005】等价表达式
设a为一个质数,模数为另一个质数,然后暴力算多项式的答案,如果答案相等就认为两个多项式相等. 这种hash有出错概率的题为什么还是要用hash呢?因为出错的概率实在太小了,a和模数的值取得好出题人根本 ...
- 【BZOJ 3053】The Closest M Points
KDTree模板,在m维空间中找最近的k个点,用的是欧几里德距离. 理解了好久,昨晚始终不明白那些“估价函数”,后来才知道分情况讨论,≤k还是=k,在当前这一维度距离过线还是不过线,过线则要继续搜索另 ...