Naive Bayes Theorm And Application - Theorem

Naive Bayes model:
1. Naive Bayes model
2. model: discrete attributes with finit number of values
2. Parameter density estimation
3. Naive Bayes classification algorithm
4. AutoClass clustering alogrithm

\(\textbf{1. Naive Bayes model}\)
In this model, We want to estimate \(P(X_1,...,X_n)\), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.\[P(X_1,...,X_n)=\Pi(X_1,...,P_n)\]While \(P(X_i)\) can be any distribution you like, e.g. \(\{0.5 : red, 0.2 : blue, 0.3 : yellow\}\).
To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have \(2_n\) states of \(X_1,...,X_n\) and \(2^n-1\) independent parameters; While adding independence assumption, the the scale of parameters decreases to \(2n+1\), and the parameter and n parameters in total.
For example,in a classification problem,We assume \(\theta_C\) is the probability class. Then we have 2n+1 parameters:
\[\begin{align}
&P(C=T)=\theta_C \notag\\
&P(C=T)=1 - \theta_C \notag\\
&P(X_i=T|C=T)=\theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^F \notag\\
& \mathbf{\theta}\langle {\theta_C, \theta_{i}^T,...,\theta_{n}^T,\theta_{i}^F,...\theta_{n}^F} \rangle \notag\\
\end{align}\]
As you can see above, it makes incredible saving in number of parameters. Representing \(P(X_1,.., X_n)\) explicitly suffers from curse of dimensionality, while \(\Pi_{i=1}^n P(X_i)\) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.

Naive Bayes classifier
For a NB classification problem we should learn:
1. \(P(X_1,...X_n|C)=\Pi_{i=1}^nP(X_i|C)\), for each class assumes that \(X_i\) and \(X_j\) are conditionally independent of each other given C.
2. P(C)
To classify: given \(\bf{x}\), choose c that maximizes:
\[P(c|x)~\propto~P(c)P(x_i|c)\]
The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.
There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - \(\textbf{eaiser to learn}\).

In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.

A simple problem:
There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:
* Variable X has states {f, t}(t = win)
* Parameter \(\theta=P(X=t)\)
* Observations \(X^1=t, X^2=f, X^3=f\)
* These comprise the data \(\bf{D}\)
* Task: estimate \(\theta\)
* Use \(\theta\) to estimate \(P(X^4=t)\)
Firstly we will introduce \(\textbf{Maximum likelihood(ML)}\) algorithm, the likelihood mean:
* \(\textbf{Likelihhod:}\) \(\bf{L}(\theta)=P(\bf{D}|\theta)=P(X^1,X^2,X^3|\theta)\)
* \(\textbf{ML Principle:}\) Choose \(\theta\) so as to maximize\(\mathbf{L}(\theta)\)
* \(L(\theta)=P(X^1|\theta)P(X^2|\theta)P(X^3|\theta)\)
* \(\textbf{Log likelihood:}\) \[\bf{LL}(\theta)=logP(X^1|\theta)+logP(X^2|\theta)+logP(X^3|\theta)\]
* ML Principle equivalent: Choose \(\theta\) so as to maximize \(LL(\theta)\)
In this example:
\(P(X^i=t|\theta)=\theta\)
\(L(\theta)=P(X^1=t, X62=f, X^3=f|\theta)=\theta(1-\theta)(1-\theta)\)
\(LL(\theta)=log\theta+2log(1-\theta)\)
set derivative to 0: \(\frac{1}{\theta}~-~\frac{2}{1-\theta} = 0\)
Solve to find: \(\theta=1/3\)
In the example just now, you can find that \(\theta=1/3\) exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML's estimate is exactly the one most suggested by the data. More generally, we get the observations \(X^1,X^2,...,X^n\), let \(N_t\) be the number of instances with value t and \(N_f\) be the number of instances with value f. Then the maximum likelihood estimate for \(\theta\) is: \[\hat{\theta}=\frac{N_t}{N_t+N_f}=\frac{N_t}{N}\]
\(\textbf{Problem with this approach}\)
\(\color{red}{Overfits}:\) pays too much attention to noise in the data, for example if the team was particularly compete with \(\color{red}{\textbf{Chinese national soccer team}}\) recently, then we will oversee the team performance.
\(\color{red}{Ignores\,prior\,experience}\): If some experts told you that the team is a small team, you should not be confident even you have won CNS.
Events don't occur in the data are deemed impossible, for example the match end with 1 vs 1.

\(\textbf{Incorporating a prior}\)
* \(\textbf{Prior}:\) \(P(\theta)\) before seeing any data
* \(\textbf{Posterior:}\) \(P(\theta|\mathbf{D})\)
* \(\textbf{Maximum a Posterior principle (MAP):}\) Choose \(\theta\) to maximize \(P(\theta|\mathbf{D})\) and \(P(\theta|\mathbf{D})\) is proportional to \(P(\theta)L(\theta)\)
For learning the parameter of a Boolean random variable, an appropriate prior over \(\theta\) is the \(\textbf{beta}\) distribution. As you know, the beta distribution has 2 \(\textbf{parameters:}\) \(\alpha\) and \(\beta\), and these paramters control the shape of the prior. \(\alpha\) and \(\beta\) control how relatively likely true and false outcomes are, if \(\alpha\) is large relative to \(\beta\), \(\theta\) will be more likely to be large.
As the graph below:

And the \(\color{red}{magnitude}\) of \(\alpha\) and \(\beta\) control how peaked the beta distribution is, if \(\alpha\) and \(\beta\) are large, the beta will be sharply peaked.
The magnitude of \(\alpha\) and \(\beta\):

Updating the prior
To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:
for example, the prior is Beta(4, 7), and we observe 1 "+"" and 4 "-", then the posterior is Beta(5, 11).

Understanding the hyperparameters
Hyperparameter \(\alpha\) represents the number of previous positive observation that we had, plus 1; similarly, \(\beta\) represents the number of previous "-" observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. \(\textbf{The more we trust our prior experience, the larger the hyperparameters in the prior.}\)

Mode and mean of the beta distribution
The mode of \(Beta(\alpha,\beta)\) is \(\frac{\alpha-1}{\alpha-\beta-2}\). e.g. mode of Beta(2,3) is 1/3. The mean of \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha+\beta}\). e.g. mean of Beta(2,3) is 2/5.

MAP estimate
For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\]

In the example above, the prior is Beta(5,3), and the observations is: \(X^1=t, X^2=f,X^3=f\), so the posterior is: Beta(6,5), The MAP estimate is: \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\]
ML vs MAP
Maximum likelihood estimate: \(\hat{\theta_{ML}}=\frac{m}{N}\); MAP estimate: \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\). And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.
Drawback of MAP
MAP not fully consider the range of possible values for \(\theta\), only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, \(\textbf{Baysian approach}\), this approach not makeestimate of \(\theta\). and the posterior distribution is maintainted over the value of \(\theta\). e.g. given \(X^1,X^2,X^3\) we want to predict \(X^4\) using the entire distribution over \(\theta\).
\[\begin{align}
P(X^4|X^1,X^2,X^3)&=\int_{0}^1{P(X^4|\theta)P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = \int_{0}^1\theta{P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = E\left[ {\theta |X^1, X^2, X^3} \right] \notag\\
& = \mathbf{mean}\,of\,the\,posterior \notag\\
\end{align}\]
For beta distribution \(E\left[{\theta=\frac{m+\alpha}{N+\alpha+\beta}}\right]\)
\(E\left[{\theta}\right]\) is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.
In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\), while the Bayesian approach Expectation is: \(\hat{\theta_{MAP}}=\frac{m+\alpha}{N+\alpha+\beta}=\frac{6}{11}\), we can see the latter is more closely to 1/2.

\(\textbf{Multi-valued class and attributes}\)
In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.
given that: |C| = k and |X|=m, and parameters are listed as below:
\[\begin{align}
&P(C=c)=\theta_c \notag\\
&P(X_i=x|C=c)=\theta_{i,x}^c \notag\\
\end{align}\]
So the parameters grows up to kmn+k-1, the counts of instances is also listed below:
\[\begin{align}
&- N=total\,number\,of\,instances \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c\,and\,X_i=x \notag\\
&- k*m*n+k+1\,parameters\,in\,total. \notag\\
\end{align}\]
From the analysis above, we can make a brief conclusion about Naive Bayes:
The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.
\(\textbf{reference}\)
Note: Most content of this essay are from CMU lecture, while I lose the link of the website.
CMU statistical learning

Naive Bayes Theorem and Application - Theorem的更多相关文章

机器学习---用python实现朴素贝叶斯算法（Machine Learning Naive Bayes Algorithm Application）
在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...
学习笔记之Naive Bayes Classifier
Naive Bayes classifier - Wikipedia https://en.wikipedia.org/wiki/Naive_Bayes_classifier In machine l ...
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...
[机器学习] 分类 --- Naive Bayes（朴素贝叶斯）
Naive Bayes-朴素贝叶斯 Bayes' theorem(贝叶斯法则) 在概率论和统计学中,Bayes' theorem(贝叶斯法则)根据事件的先验知识描述事件的概率.贝叶斯法则表达式如下所示 ...
机器学习算法 --- Naive Bayes classifier
一.引言在开始算法介绍之前,让我们先来思考一个问题,假设今天你准备出去登山,但起床后发现今天早晨的天气是多云,那么你今天是否应该选择出去呢? 你有最近这一个月的天气情况数据如下,请做出判断. 这个月 ...
ML | Naive Bayes
what's xxx In machine learning, naive Bayes classifiers are a family of simple probabilistic classif ...
Spark MLlib 之 Naive Bayes
1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...
[Machine Learning & Algorithm] 朴素贝叶斯算法（Naive Bayes）
生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子让我从一个例子 ...
Microsoft Naive Bayes 算法——三国人物身份划分
Microsoft朴素贝叶斯是SSAS中最简单的算法,通常用作理解数据基本分组的起点.这类处理的一般特征就是分类.这个算法之所以称为“朴素”,是因为所有属性的重要性是一样的,没有谁比谁更高.贝叶斯之名 ...

随机推荐

ORA-01652 错误中报出的不是Temp表空间的情况。
ORA-01652 unable to extend temp segment by %s in tablespace %s 注意这里的temp segment并不一定就是指临时表空间, 也可能是其 ...
OC中对象拷贝概念
OC中的对象拷贝概念,这个对于面向对象语言中都会有这种的问题,只是不同的语言有不同的解决方式:C++中有拷贝构造函数,Java中需要实现Cloneable接口,在clone方法中进行操作.但是不过OC ...
获取ini文件所有的Sections和Keys
获取ini文件中所有的Sections和Keys,并以pair对的方式存入到vector中 #include <iostream> #include <windows.h> # ...
Java 工厂模式学习
工厂模式分三种:简单工厂.工厂方法.抽象工厂.其中抽象工厂是用于多个产品族的情况.所谓产品族就是不同产品组成的有机整体,各不同产品之间相互依赖.打个比方,一台电脑有CPU.主板.内存和硬盘等,这些不同 ...
出现java.lang.NoSuchFieldException resourceEntries错误的解决方法
JSP表单里面的表单输入<input type= "text" name="user">这里面的每一个输入都是一个Attribute,相当于setA ...
leetcode Binary Tree Postorder Traversal python
# Definition for a binary tree node. # class TreeNode(object): # def __init__(self, x): # self.val = ...
Docker容器的网络连接
Docker容器的网络连接 Docker容器的网络连接我们用ifconfig命令来查看网络设备我们可以看到上面有个叫docker0的网络设备,docker守护进程就是通过docker0为docke ...
Python封装的访问MySQL数据库的类及DEMO
# Filename:mysql_class.py # Author:Rain.Zen; Date: 2014-04-15 import MySQLdb class MyDb: '''初始化[类似于构 ...
javaScript的2种变量范围有什么不同
1.javascript怎样选中一个checkbox,怎样设置它无效? document.all.cb1[0].disabled = true; 2.js中的3种弹出式消息提醒(警告窗口,确认窗口 ...
how to install git 1.8 rpm
git版本在低于1.8之前,对于私有项目会出现401的pull失败错误,只能通过升级git版本来解决 It appears that git18 is no longer available from ...

Naive Bayes Theorem and Application - Theorem

Naive Bayes Theorm And Application - Theorem

Naive Bayes Theorem and Application - Theorem的更多相关文章

随机推荐

热门专题