Imagination is an outcome of what you learned. If you can imagine the world, that means you have learned what the world is about.

Actually we don't know how we see, at lease it's really hard to know, so we can't program to tell a machine to see.

One of the most important part in machine learning is to introspect how our brain learn by subconscious. If we can't introspect, it can be fairly hard to replicate a brain.

Linear Models

Supervised learning of linear models can be divided into 2 phases:

  • Training:

    1. Read training data points with labels \(\left\{\mathbf{x}_{1:n},y_{1:n}\right\}\), where \(\mathbf{x}_i \in \mathbb{R}^{1 \times d}, \ y_i \in \mathbb{R}^{1 \times c}\);
    2. Estimate model parameters \(\hat{\theta}\) by certain learning Algorithms.
      Note: The parameters are the information the model learned from data.
  • Prediction:
    1. Read a new data point without label \(\mathbf{x}_{n+1}\) (typically has never seen before);
    2. Along with parameter \(\hat{\theta}\), estimate unknown label \(\hat{y}_{n+1}\).

1-D example:
First of all, we create a linear model:
\[
\hat{y}_i = \theta_0 + \theta_1 x_{i}
\]
Both \(x\) and \(y\) are scalars in this case.

Then we, for example, take SSE (Sum of Squared Error) as our objective / loss / cost / energy / error function1:

\[
J(\theta)=\sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2
\]

Linear Prediction Model

In general, each data point \(x_i\) should have \(d\) dimensions, and the corresponding number of parameters should be \((d+1)\).

The mathematical form of linear model is:
\[
\hat{y}_i = \sum_{j=0}^{d} \theta_jx_{ij}
\]

The matrix form of linear model is:
\[
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{bmatrix}=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nd}
\end{bmatrix}
\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_d
\end{bmatrix}
\]
Or in a more compact way:
\[
\mathbf{\hat{y}} = \mathbf{X\theta}
\]
Note that the matrix form is widely used not only because it's a concise way to represent the model, but is also straightforward for coding in MatLab or Python (Numpy).

Optimization Approach

In order to optimize the model prediction, we need to minimize the quadratic cost:
\[
J(\mathbf{\theta}) = \sum_{i=1}^n \left( \hat{y}_i - y_i\right)^2 \\
= \left( \mathbf{y-X\theta} \right)^\mathtt T\left( \mathbf{y-X\theta} \right)
\]

by setting the derivatives w.r.t vector \(\mathbf{\theta}\) to zero since the cost function is strictly convex and the domain of \(\theta\) is convex2.

\[
\begin{align*}\notag
\frac{\partial J(\mathbf{\theta})}{\partial \mathbf{\theta}} &= \frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y-X\theta} \right)^\mathtt T\left( \mathbf{y-X\theta} \right) \\
&=\frac{\partial}{ \partial \mathbf{\theta} } \left( \mathbf{y}^\mathtt T\mathbf{y} + \mathbf{\theta}^\mathtt T \mathbf{X}^\mathtt T\mathbf{X\theta} -2\mathbf{y}^\mathtt T\mathbf{X\theta} \right) \\
&=\mathbf{0}+2 \left( \mathbf{X}^\mathtt T\mathbf{X} \right)^\mathtt T \mathbf{\theta} - 2 \left( \mathbf{y}^\mathtt T\mathbf{X} \right)^\mathtt T \\
&=2 \left( \mathbf{X}^\mathtt T\mathbf{X} \right) \mathbf{\theta} - 2 \left( \mathbf{X}^\mathtt T\mathbf{y} \right) \\
&\triangleq\mathbf{0}
\end{align*}
\]

So we get \(\mathbf{\hat{\theta}}\) as an analytical solution:
\[
\mathbf{\hat{\theta}} = \left( \mathbf{X}^\mathtt T\mathbf{X} \right)^{-1} \left( \mathbf{X}^\mathtt T\mathbf{y} \right)
\]

After passing by these procedures, we can see that learning is just about to adjust model parameters so as to minimize the objective function.
Thus, the prediction function can be rewrite as:
\[
\begin{align*}\notag
\mathbf{\hat{y}} &= \mathbf{X\hat{\theta}}\\
&=\mathbf{X}\left( \mathbf{X}^\mathtt T\mathbf{X} \right)^{-1} \mathbf{X}^\mathtt T\mathbf{y}
\triangleq \mathbf{Hy}
\end{align*}
\]
where \(\mathbf{H}\) refers to hat matrix because it added hat to \(\mathbf{y}\)

Multidimensional Label \(\mathbf{y_i}\)

So far we have been assuming \(y_i\) to be a scalar. But what if the model have multiple outputs (e.g. \(c\) outputs)? Simply align with \(c\) parameters:
\[
\begin{bmatrix}
y_{11} & \cdots & y_{1c} \\
y_{21} & \cdots & y_{2c} \\
\vdots & \ddots & \vdots \\
y_{n1} & \cdots & y_{nc}
\end{bmatrix}=
\begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nd}
\end{bmatrix}
\begin{bmatrix}
\theta_{01} & \cdots & \theta_{0c}\\
\theta_{11} & \cdots & \theta_{1c}\\
\theta_{21} & \cdots & \theta_{2c}\\
\vdots & \ddots & \vdots \\
\theta_{d1}& \cdots & \theta_{dc}
\end{bmatrix}
\]

Linear Regression with Maximum Likelihood

If we assume that each label \(y_i\) is Gaussian distributed with mean \(x_i^{\mathtt{T}} \theta\) and variance \(\sigma^2\):
\[
y_i \sim N(x_i^{\mathtt{T}}\theta, \sigma^2) = \left( 2\pi\sigma^2 \right)^{-1/2} e^{ -\frac{\left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2} }
\]

Likelihood

With a reasonable i.i.d. assumption over \(\mathbf{y}\), we can decompose the joint distribution of likelihood:
\[
\begin{align*}\notag
p( \mathbf{y}|\mathbf{X,\theta,\sigma^2} ) &= \prod_{i=1}^n {p(y_i|\mathbf{x}_i,\theta,\sigma^2} ) \\
&=\prod_{i=1}^n \left( 2\pi\sigma^2 \right)^{-1/2} e^{ -\frac{\left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2} } \\
&=\left( 2\pi\sigma^2 \right)^{-n/2} e^{-\frac{\sum_{i=1}^n \left( y_i-x_i^{\mathtt{T}}\theta \right)^2}{2\sigma^2}} \\
&= \left( 2\pi\sigma^2 \right)^{-n/2} e^{-\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2}}
\end{align*}\notag
\]

Maximum Likelihood Estimation

Then our goal is to maximize the probability of the label in our Gaussian linear regression model w.r.t. \(\theta\) and \(\sigma\).

Instead of minimizing the cost function SSE (length of blue lines), this time we maximize likelihood (length of green lines) to optimize the model parameters.

Since \(\log\) function is monotonic and can simplify exponent function, here we utilize log-likelihood:
\[
\log p( \mathbf{y}|\mathbf{X,\theta}, \sigma^2 ) = -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2}
\]

MLE of \(\theta\):
\[
\begin{align*}\notag
\frac{\partial {\log p( \mathbf{y}|\mathbf{X,\theta,\sigma^2} )} }{\partial {\theta}} &= \frac{\partial}{\partial \theta} \left[ -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2} \right] \\
&= 0 - \frac{1}{2\sigma^2} \frac{\partial{(\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta})}}{\partial{\theta}} \\
&= -\frac{1}{2\sigma^2} \frac{ \partial{ \left( \mathbf{y}^{\mathtt{T}}\mathbf{y} + \theta^{\mathtt{T}} \mathbf{X}^{\mathtt{T}} \mathbf{X\theta} - 2\mathbf{y}^{\mathtt{T}}\mathbf{X\theta} \right) } }{\partial{\theta}} \\
&= -\frac{1}{2\sigma^2} \left[ 0+ 2\left( \mathbf{X^{\mathtt{T}}X} \right)^{\mathtt{T}}\theta - 2\left( \mathbf{y}^{\mathtt{T}}\mathbf{X} \right)^{\mathtt{T}} \right] \\
&= -\frac{1}{2\sigma^2} \left[ 2\mathbf{X^{\mathtt{T}}X\theta} - 2\mathbf{X}^{\mathtt{T}}\mathbf{y} \right] \triangleq 0
\end{align*}
\]
There's no surprise that the estimation of maximum likelihood is identical to that of least-square method.
\[
\hat\theta_{MLE} = \left( \mathbf{X}^{\mathtt{T}}\mathbf{X} \right)^{-1} \mathbf{X}^{\mathtt{T}} \mathbf{y}
\]

Besides where the "line" is, using MLE with Gaussian will give us the uncertainty, or confidence as another parameter, of the prediction \(\mathbf{\hat y}\)
MLE of \(\sigma^2\):
\[
\begin{align*}\notag
\frac{\partial {\log p( \mathbf{y}|\mathbf{X,\theta}, \sigma^2 )} }{\partial {\sigma}} &= \frac{\partial}{\partial \sigma} \left[ -\frac{n}{2} \log \left( 2\pi\sigma^2 \right) -\frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) } {2\sigma^2} \right] \\
&= -\frac{n}{2} \frac{1}{2\pi\sigma^2} 4\pi\sigma + 2 \frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) }{2\sigma^3} \\
&= -\frac{n}{\sigma} + \frac{ (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) }{\sigma^3} \triangleq 0
\end{align*}
\]
Thus, we get:
\[
\begin{align*}\notag
\hat\sigma_{MLE}^2 &= \frac1n (\mathbf{y-X\theta})^{\mathtt{T}} (\mathbf{y-X\theta}) \\
&= \frac1n \sum_{i=1}^n \left(y_i-\mathbf{x}_i^\mathtt{T}\theta \right)^2
\end{align*}
\]
which is the standard estimate of variance, or mean squared error (MSE).
However, this uncertainty estimator does not work very well. We'll see another uncertainty estimator later that is very powerful.

Again, we analytically obtain the optimal parameters for the model to describe labeled data points.

Prediction

Since we have had the optimal parameters \(\left(\theta_{MLE},\sigma_{MLE}^2\right)\) of our linear regression model, making prediction is simply get the mean of the Gaussian given different test data point \(\mathbf x_*\):
\[
\hat y_* = \mathbf x_*^{\mathtt T}\theta_{MLE}
\]
with uncertainty \(\sigma_{MLE}^2\).

Frequentist Learning

Maximum Likelihood Learning is part of frequentist learning.

Frequentist learning assumes there is a truth (true model) of parameter \(\theta_{truth}\) that if we had adequate data, we would be able to recover that truth. The core of learning in this case is to guess / estimate / learn the parameter \(\hat \theta\) w.r.t. the true model given finite number of training data.

Maximum likelihood is essentially trying to approximate model parameter \(\theta_{truth}\) by maximizing likelihood (joint probability of data given parameter), i.e.

Given \(n\) data points \(\mathbf X = [\mathbf x_1, \cdots,\mathbf x_n]\) with corresponding labels \(\mathbf y = [y_1, \cdots, y_n]\), we choose the value of model parameter \(\theta\) that is most probable to generate such data points.

Also note that frequentist learning relies on Law of Large Numbers.

KL Divergence and MLE

Given i.i.d assumption on data \(\mathbf X\) from distribution \(p(\mathbf X|\theta_{true})\):
\[
p(\mathbf X|\theta_{true})=\prod_{i=1}^n p(\mathbf x_i|\theta_{true}) \\
\begin{align*}
\theta_{MLE} &= \arg \underset {\theta}{\max} \prod_{i=1}^n p(\mathbf x_i|\theta) \\
&= \arg \underset {\theta}{\max}\sum_{i=1}^n \log p(\mathbf x_i|\theta)
\end{align*}
\]
Then we add a constant value \(-\sum_{i=1}^n \log p(\mathbf x_i|\theta_{true})\) onto the equation and then divide by the constant number \(n\):
\[
\begin{align*}
\theta_{MLE} &= \arg \underset {\theta}{\max} \frac1 n\sum_{i=1}^n \log p(\mathbf x_i|\theta) -\frac1 n\sum_{i=1}^n \log p(\mathbf x_i|\theta_{true})\\
&= \arg \underset {\theta} {\max} \frac 1 n \log \frac{p(\mathbf x_i|\theta)}{p(\mathbf x_i|\theta_{true})}
\end{align*}
\]

Recall Law of Large Numbers that is: as \(n\rightarrow \infty\),
\[
\frac 1 n\sum_{i=1}^nx_i\rightarrow\int xp(x)\mathrm dx=\mathbb E[x]
\]
where \(x_i\) is simulated from \(p(x)\)

Again, we know from frequentist learning that data point \(\mathbf x_i\sim p(\mathbf x|\theta)\). Hence, as \(n\) goes \(\infty\), the MLE of \(\theta\) becomes
\[
\begin{align*}
\theta_{MLE}&=\arg \underset{\theta}{\max} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta)}{p(\mathbf x|\theta_{true})} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min}\ \mathbb E_{p(\mathbf x|\theta_{true})} \left[ \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} \right] \\
&=\arg \underset{\theta}{\min}\ \mathrm {KL} \left[ p(\mathbf x|\theta_{true})\ ||\ p(\mathbf x|\theta) \right]
\end{align*}
\]
Therefore, maximizing likelihood is equivalent to minimizing KL divergence.

Entropy and MLE

In the last part, we get
\[
\begin{align*}
\theta_{MLE}&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log \frac{p(\mathbf x|\theta_{true})}{p(\mathbf x|\theta)} p(\mathbf x|\theta_{true}) \mathrm dx \\
&=\arg \underset{\theta}{\min} \int_{\mathbf x} \log p(\mathbf x|\theta_{true}) p(\mathbf x|\theta_{true}) \mathrm dx - \int_{\mathbf x} \log p(\mathbf x|\theta) p(\mathbf x|\theta_{true}) \mathrm dx
\end{align*}
\]
The first integral in the equation above is negative entropy w.r.t. true parameter \(\theta_{true}\), i.e. information in the world , while the second integral is negative cross entropy w.r.t. model parameter \(\theta\) and true parameter \(\theta_{true}\)., i.e. information from model. The equation says, if the information in the world matches information from model, then the model has learned!

Statistical Quantities of Frequentist Learning

There are 2 quantities that frequentist often estimate:

  • bias
  • variance

Refer: CPSC540, UBC
Written with StackEdit.


  1. SSE is known by everyone but works poorly under certain circumstances e.g. if the training data contains some noise (outliers) then the model will be distorted seriously by outliers.

  2. See one of some interesting explanations here

Linear Regression and Maximum Likelihood Estimation的更多相关文章

  1. 似然函数 | 最大似然估计 | likelihood | maximum likelihood estimation | R代码

    学贝叶斯方法时绕不过去的一个问题,现在系统地总结一下. 之前过于纠结字眼,似然和概率到底有什么区别?以及这一个奇妙的对等关系(其实连续才是f,离散就是p). 似然函数 | 似然值 wiki:在数理统计 ...

  2. Maximum Likelihood及Maximum Likelihood Estimation

    1.What is Maximum Likelihood? 极大似然是一种找到最可能解释一组观测数据的函数的方法. Maximum Likelihood is a way to find the mo ...

  3. 最大似然估计 (Maximum Likelihood Estimation), 交叉熵 (Cross Entropy) 与深度神经网络

    最近在看深度学习的"花书" (也就是Ian Goodfellow那本了),第五章机器学习基础部分的解释很精华,对比PRML少了很多复杂的推理,比较适合闲暇的时候翻开看看.今天准备写 ...

  4. 最大似然估计(Maximum likelihood estimation)(通过例子理解)

    似然与概率 https://blog.csdn.net/u014182497/article/details/82252456 在统计学中,似然函数(likelihood function,通常简写为 ...

  5. 均匀分布(uniform distribution)期望的最大似然估计(maximum likelihood estimation)

    maximum estimator method more known as MLE of a uniform distribution [0,θ] 区间上的均匀分布为例,独立同分布地采样样本 x1, ...

  6. 最大似然预计(Maximum likelihood estimation)

    一.定义     最大似然预计是一种依据样本来预计模型參数的方法.其思想是,对于已知的样本,如果它服从某种模型,预计模型中未知的參数,使该模型出现这些样本的概率最大.这样就得到了未知參数的预计值. 二 ...

  7. 【MLE】最大似然估计Maximum Likelihood Estimation

    模型已定,参数未知 已知某个随机样本满足某种概率分布,但是其中具体的参数不清楚,参数估计就是通过若干次试验,观察其结果,利用结果推出参数的大概值.最大似然估计是建立在这样的思想上:已知某个参数能使这个 ...

  8. 最大似然估计(Maximum likelihood estimation)

    最大似然估计提供了一种给定观察数据来评估模型参数的方法,即:"模型已定,参数未知".简单而言,假设我们要统计全国人口的身高,首先假设这个身高服从服从正态分布,但是该分布的均值与方差 ...

  9. MLE vs MAP: the connection between Maximum Likelihood and Maximum A Posteriori Estimation

    Reference:MLE vs MAP. Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a ...

随机推荐

  1. [USACO09MAR]Cow Frisbee Team

    嘟嘟嘟 这个是一个很明显的dp,遇到这种倍数的问题的,就令dp[i][j]表示选到了第 i 只牛(不是选了 i 只牛),sum(Ri) % f == j 的方案数,则, dp[i][j] = dp[i ...

  2. IPython学习笔记(二)-魔术命令

    .魔术命令:以%为前缀的命令,是ipython的特殊命令,方便完成常见的任务.,常见的魔术命令有:%run,%paste,%cpaste,%timeit,%reset,%hist,%debug,%bo ...

  3. C中typedef 函数指针的使用

    类型定义的语法可以归结为一句话:只要在变量定义前面加上typedef,就成了类型定义.这儿的原本应该是变量的东西,就成为了类型. int integer;     //整型变量int *pointer ...

  4. 构建Vue开发环境

    1.开发环境的准备工作 IDE 可以选择WebStom或者VisualStudio Code Node.js的安装 node + npm 调试环境 Google Chrome + Vue.js 2.什 ...

  5. “C++动态绑定”相关问题探讨

    一.相关问题: 1. 基类.派生类的构造和析构顺序 2. 基类.派生类中virtual的取舍 二.测试代码: #include <iostream> class A { public: A ...

  6. 实际项目开发过程中常用C语言函数的9大用法

    C语言是当中最广泛的计算机编程语言,是所有计算机编程语言的祖先,其他计算机编程语言包括当前流行的Java语言,都是用C语言实现的,C语言是编程效率最高的计算机语言,既能完成上层应用开发,也能完成底层硬 ...

  7. C语言学习记录_2019.02.07

    C99开始,可以用变量来定义数组的大小:例如,利用键盘输入的变量来定义数组大小: 赋值号左边的值叫做左值: 关于数组:编译器和运行环境不会检查数组下标是否越界,无论读还是写. 越界数组可能造成的问题提 ...

  8. C语言学习记录_2019.02.05

    switch只能判断整数,而分段函数的判别是一个范围,我们无法用整数来表示范围 跟踪语句的方法: (1)debug调试 (2)printf( )语句跟踪 小套路:当循环次数很大时,可以先模拟较小次数的 ...

  9. Lua 语言学习

    详细讲解见菜鸟教程 Lua. 一.数据类型 -- 直接输出 print("hello") -- 全局变量 b = print(b) -- nil(空) print(type(a)) ...

  10. Dotnet Core Cli 解决方案中多个项目的相互引用和第三方库引用

    dotnet add app/app.csproj reference lib/lib.csproj app项目引用lib项目 dotnet add package Newtonsoft.Json 当 ...