Naive Bayes Theorem and Application - Theorem
Naive Bayes Theorm And Application - Theorem
Naive Bayes model:
1. Naive Bayes model
2. model: discrete attributes with finit number of values
2. Parameter density estimation
3. Naive Bayes classification algorithm
4. AutoClass clustering alogrithm
\(\textbf{1. Naive Bayes model}\)
In this model, We want to estimate \(P(X_1,...,X_n)\), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.\[P(X_1,...,X_n)=\Pi(X_1,...,P_n)\]While \(P(X_i)\) can be any distribution you like, e.g. \(\{0.5 : red, 0.2 : blue, 0.3 : yellow\}\).
To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have \(2_n\) states of \(X_1,...,X_n\) and \(2^n-1\) independent parameters; While adding independence assumption, the the scale of parameters decreases to \(2n+1\), and the parameter and n parameters in total.
For example,in a classification problem,We assume \(\theta_C\) is the probability class. Then we have 2n+1 parameters:
\[\begin{align}
&P(C=T)=\theta_C \notag\\
&P(C=T)=1 - \theta_C \notag\\
&P(X_i=T|C=T)=\theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^F \notag\\
& \mathbf{\theta}\langle {\theta_C, \theta_{i}^T,...,\theta_{n}^T,\theta_{i}^F,...\theta_{n}^F} \rangle \notag\\
\end{align}\]
As you can see above, it makes incredible saving in number of parameters. Representing \(P(X_1,.., X_n)\) explicitly suffers from curse of dimensionality, while \(\Pi_{i=1}^n P(X_i)\) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.
Naive Bayes classifier
For a NB classification problem we should learn:
1. \(P(X_1,...X_n|C)=\Pi_{i=1}^nP(X_i|C)\), for each class assumes that \(X_i\) and \(X_j\) are conditionally independent of each other given C.
2. P(C)
To classify: given \(\bf{x}\), choose c that maximizes:
\[P(c|x)~\propto~P(c)P(x_i|c)\]
The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.
There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - \(\textbf{eaiser to learn}\).
In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.
A simple problem:
There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:
* Variable X has states {f, t}(t = win)
* Parameter \(\theta=P(X=t)\)
* Observations \(X^1=t, X^2=f, X^3=f\)
* These comprise the data \(\bf{D}\)
* Task: estimate \(\theta\)
* Use \(\theta\) to estimate \(P(X^4=t)\)
Firstly we will introduce \(\textbf{Maximum likelihood(ML)}\) algorithm, the likelihood mean:
* \(\textbf{Likelihhod:}\) \(\bf{L}(\theta)=P(\bf{D}|\theta)=P(X^1,X^2,X^3|\theta)\)
* \(\textbf{ML Principle:}\) Choose \(\theta\) so as to maximize\(\mathbf{L}(\theta)\)
* \(L(\theta)=P(X^1|\theta)P(X^2|\theta)P(X^3|\theta)\)
* \(\textbf{Log likelihood:}\) \[\bf{LL}(\theta)=logP(X^1|\theta)+logP(X^2|\theta)+logP(X^3|\theta)\]
* ML Principle equivalent: Choose \(\theta\) so as to maximize \(LL(\theta)\)
In this example:
\(P(X^i=t|\theta)=\theta\)
\(L(\theta)=P(X^1=t, X62=f, X^3=f|\theta)=\theta(1-\theta)(1-\theta)\)
\(LL(\theta)=log\theta+2log(1-\theta)\)
set derivative to 0: \(\frac{1}{\theta}~-~\frac{2}{1-\theta} = 0\)
Solve to find: \(\theta=1/3\)
In the example just now, you can find that \(\theta=1/3\) exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML's estimate is exactly the one most suggested by the data. More generally, we get the observations \(X^1,X^2,...,X^n\), let \(N_t\) be the number of instances with value t and \(N_f\) be the number of instances with value f. Then the maximum likelihood estimate for \(\theta\) is: \[\hat{\theta}=\frac{N_t}{N_t+N_f}=\frac{N_t}{N}\]
\(\textbf{Problem with this approach}\)
\(\color{red}{Overfits}:\) pays too much attention to noise in the data, for example if the team was particularly compete with \(\color{red}{\textbf{Chinese national soccer team}}\) recently, then we will oversee the team performance.
\(\color{red}{Ignores\,prior\,experience}\): If some experts told you that the team is a small team, you should not be confident even you have won CNS.
Events don't occur in the data are deemed impossible, for example the match end with 1 vs 1.
\(\textbf{Incorporating a prior}\)
* \(\textbf{Prior}:\) \(P(\theta)\) before seeing any data
* \(\textbf{Posterior:}\) \(P(\theta|\mathbf{D})\)
* \(\textbf{Maximum a Posterior principle (MAP):}\) Choose \(\theta\) to maximize \(P(\theta|\mathbf{D})\) and \(P(\theta|\mathbf{D})\) is proportional to \(P(\theta)L(\theta)\)
For learning the parameter of a Boolean random variable, an appropriate prior over \(\theta\) is the \(\textbf{beta}\) distribution. As you know, the beta distribution has 2 \(\textbf{parameters:}\) \(\alpha\) and \(\beta\), and these paramters control the shape of the prior. \(\alpha\) and \(\beta\) control how relatively likely true and false outcomes are, if \(\alpha\) is large relative to \(\beta\), \(\theta\) will be more likely to be large.
As the graph below:
And the \(\color{red}{magnitude}\) of \(\alpha\) and \(\beta\) control how peaked the beta distribution is, if \(\alpha\) and \(\beta\) are large, the beta will be sharply peaked.
The magnitude of \(\alpha\) and \(\beta\):
Updating the prior
To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:
for example, the prior is Beta(4, 7), and we observe 1 "+"" and 4 "-", then the posterior is Beta(5, 11).
Understanding the hyperparameters
Hyperparameter \(\alpha\) represents the number of previous positive observation that we had, plus 1; similarly, \(\beta\) represents the number of previous "-" observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. \(\textbf{The more we trust our prior experience, the larger the hyperparameters in the prior.}\)
Mode and mean of the beta distribution
The mode of \(Beta(\alpha,\beta)\) is \(\frac{\alpha-1}{\alpha-\beta-2}\). e.g. mode of Beta(2,3) is 1/3. The mean of \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha+\beta}\). e.g. mean of Beta(2,3) is 2/5.
MAP estimate
For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\]
In the example above, the prior is Beta(5,3), and the observations is: \(X^1=t, X^2=f,X^3=f\), so the posterior is: Beta(6,5), The MAP estimate is: \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\]
ML vs MAP
Maximum likelihood estimate: \(\hat{\theta_{ML}}=\frac{m}{N}\); MAP estimate: \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\). And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.
Drawback of MAP
MAP not fully consider the range of possible values for \(\theta\), only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, \(\textbf{Baysian approach}\), this approach not makeestimate of \(\theta\). and the posterior distribution is maintainted over the value of \(\theta\). e.g. given \(X^1,X^2,X^3\) we want to predict \(X^4\) using the entire distribution over \(\theta\).
\[\begin{align}
P(X^4|X^1,X^2,X^3)&=\int_{0}^1{P(X^4|\theta)P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = \int_{0}^1\theta{P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = E\left[ {\theta |X^1, X^2, X^3} \right] \notag\\
& = \mathbf{mean}\,of\,the\,posterior \notag\\
\end{align}\]
For beta distribution \(E\left[{\theta=\frac{m+\alpha}{N+\alpha+\beta}}\right]\)
\(E\left[{\theta}\right]\) is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.
In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\), while the Bayesian approach Expectation is: \(\hat{\theta_{MAP}}=\frac{m+\alpha}{N+\alpha+\beta}=\frac{6}{11}\), we can see the latter is more closely to 1/2.
\(\textbf{Multi-valued class and attributes}\)
In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.
given that: |C| = k and |X|=m, and parameters are listed as below:
\[\begin{align}
&P(C=c)=\theta_c \notag\\
&P(X_i=x|C=c)=\theta_{i,x}^c \notag\\
\end{align}\]
So the parameters grows up to kmn+k-1, the counts of instances is also listed below:
\[\begin{align}
&- N=total\,number\,of\,instances \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c\,and\,X_i=x \notag\\
&- k*m*n+k+1\,parameters\,in\,total. \notag\\
\end{align}\]
From the analysis above, we can make a brief conclusion about Naive Bayes:
The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.
\(\textbf{reference}\)
Note: Most content of this essay are from CMU lecture, while I lose the link of the website.
CMU statistical learning
Naive Bayes Theorem and Application - Theorem的更多相关文章
- 机器学习---用python实现朴素贝叶斯算法(Machine Learning Naive Bayes Algorithm Application)
在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...
- 学习笔记之Naive Bayes Classifier
Naive Bayes classifier - Wikipedia https://en.wikipedia.org/wiki/Naive_Bayes_classifier In machine l ...
- 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...
- [机器学习] 分类 --- Naive Bayes(朴素贝叶斯)
Naive Bayes-朴素贝叶斯 Bayes' theorem(贝叶斯法则) 在概率论和统计学中,Bayes' theorem(贝叶斯法则)根据事件的先验知识描述事件的概率.贝叶斯法则表达式如下所示 ...
- 机器学习算法 --- Naive Bayes classifier
一.引言 在开始算法介绍之前,让我们先来思考一个问题,假设今天你准备出去登山,但起床后发现今天早晨的天气是多云,那么你今天是否应该选择出去呢? 你有最近这一个月的天气情况数据如下,请做出判断. 这个月 ...
- ML | Naive Bayes
what's xxx In machine learning, naive Bayes classifiers are a family of simple probabilistic classif ...
- Spark MLlib 之 Naive Bayes
1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...
- [Machine Learning & Algorithm] 朴素贝叶斯算法(Naive Bayes)
生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子 让我从一个例子 ...
- Microsoft Naive Bayes 算法——三国人物身份划分
Microsoft朴素贝叶斯是SSAS中最简单的算法,通常用作理解数据基本分组的起点.这类处理的一般特征就是分类.这个算法之所以称为“朴素”,是因为所有属性的重要性是一样的,没有谁比谁更高.贝叶斯之名 ...
随机推荐
- 改良版的SQL Service 通用存储过程分页
上次写了通用存储过程.感觉还是有很大的BUG.就是条件不能参数画化.这个BUG可以说是致命的.但是我一直想在用什么方法能解决这个东西.其实我只是想写少量的代码来做更多的事情.我想能不能传集合给存储过程 ...
- Egret及Node.js的安装部署
最近在学Html5游戏开发,我选择的是国内的一个游戏开发框架egret.因为涉及到node.js这个近年来新兴起来的技术.借此机会把这方面知识学习一下. node.js以及egret的操作类似于Lin ...
- Eclipse 浏览文件插件- OpenExplorer
http://blog.csdn.net/w709854369/article/details/6599167 EasyExplorer 是一个类似于 Windows Explorer的Eclips ...
- C++关键字(1)——const
1. const修饰普通变量和指针 const修饰变量,一般有两种写法: const TYPE value; TYPE const value; 这两种写法在本质上是一样的.它的含义是:const修饰 ...
- Struts2+Ajax实现检测用户名是否唯一
搞了慢慢两天,终于弄明白了怎么在Struts2框架中使用Ajax检测用户名的存在了.虽然,比起那些大牛们来,这速度确实够慢的,不过,最终弄出来还是满满的成就感啊. 闲话休提,言归正传.直接上代码: A ...
- leetcode Merge K sorted Lists python
# Definition for singly-linked list. # class ListNode(object): # def __init__(self, x): # self.val = ...
- Linux学习之停止进程
首先,用ps查看进程,方法如下: ps -ef ……smx 1822 1 0 11:38 ? 00:00:49 gnome-terminalsmx 18 ...
- java/php/c#版rsa签名以及验签实现
本文为转载,请转载请注明地址: 原文地址为 http://xw-z1985.iteye.com/blog/1837376 在开放平台领域,需要给isv提供sdk,签名是Sdk中需要提供的 ...
- hadoop笔记之Hive的管理(CLI方式)
Hive的管理(一) Hive的管理(一) Hive的启动方式 CLI(命令行)方式 Web界面方式 远程服务启动方式 CLI方式 1. 进入命令行方式 直接输入<HIVE_HOME>/b ...
- 判断浏览器及设备的打开方式,自动跳转app中
如果安装了APP则自动条状app,如果没安装则自动跳转下载页面 <head> 放在head中加载 <script> function redirect() { var appU ...