This is a series of Machine Learning summary note. I will combine the deep learning book with the deeplearning open course . Any feedback is welcomed!

First let's go through some basic NN concept using Bernoulli classification problem as an example.

Activation Function

1.Bernoulli Output

1. Definition

When dealing with Binary classification problem, what activavtion function should we use in the output layer?
Basically given $x \in R^n$, How to get $P(y=1|x)$ ?

2. Loss function

Let $\hat{y} = P(y=1|x) $, We would expect following output
\[P(y|x) = \begin{cases}
\hat{y} & \quad when & y= 1\\
1-\hat{y} & \quad when & y= 0\\
\end{cases}
\]
Above can be simplified as

\[P(y|x)= \hat{y}^{y}(1-\hat{y})^{1-y}\]

Therefore, the maximum likelihood of m training samples will be

\[\theta_{ML} = argmax\prod^{m}_{i=1}P(y^i|x^i)\]

As ususal we take the log of above function and get following. Actually for gradient descent log has other advantages, which we will discuss later.

\[log(\theta_{ML}) = argmax\sum^{m}_{i=1} ylog(\hat{y})+(1-y)log(1-\hat{y}) \]

And the cost function for optimization is following
\[J(w,b) = \sum ^{m}_{i=1}L(y^i,\hat{y}^i)= -\sum^{m}_{i=1}ylog(\hat{y})+(1-y)log(1-\hat{y})\]

The cost function is the sum of loss from m training samples, which measures the performance of classification algo.
And yes here it is exactly the negative of log likelihood. While Cost function can be different from negative log likelihood, when we apply regularization. But here let's start with simple version.

So here comes our next problem, how can we get 1-dimension $ log(\hat{y})$, given input $x$, which is n-dimension vector ?

3. Activtion function - Sigmoid

Let $h$ denotes the output from the previous hidden layer that goes into the final output layer. And a linear transformation is applied to $h$ before activation function.
Let $z = w^Th +b$

The assumption here is
\[log(\hat{y}) = \begin{cases}
z & \quad when & y= 1\\
0 & \quad when & y= 0\\
\end{cases}
\]
Above can be simplified as
\[log(\hat{y}) = yz\quad \to \quad \hat{y} = exp(yz)\]

This is an unnormalized distribution of $\hat{y}$. Because $y$ denotes probability, we need to further normalize it to $ [0,1]$.
\[\hat{y} = \frac{exp(yz)} {\sum^1_{y=0}exp(yz)} \\
=\frac{exp(z)}{1+exp(z)}\\
\quad \quad \quad \quad = \frac{1}{1+exp(-z)} = \sigma(z)
\]

Bingo! Here we go - Sigmoid Function: $\sigma(z) = \frac{1}{1+exp(-z)}$

\[p(y|x) = \begin{cases}
\sigma(z) & \quad when & y= 1\\
1-\sigma(z) & \quad when & y= 0\\
\end{cases}
\]
Sigmoid function has many pretty cool features like following:
\[ 1- \sigma(x) = \sigma(-x) \\
\frac{d}{dx} \sigma(x) = \sigma(x)(1-\sigma(x)) \\
\quad \quad = \sigma(x)\sigma(-x)
\]

Using the first feature above, we can further simply the bernoulli output into following:
\[p(y|x) = \sigma((2y-1)z)\]

4. gradient descent and back propagation

Now we have target cost fucntion to optimize. How does the NN learn from training data? The answer is -- Back Propagation.

Actually back propagation is not some fancy method that is designed for Neural Network. When training sample is big, we can use back propagation to train linear regerssion too.

Back Propogation is iteratively using the partial derivative of cost function to update the parameter, in order to reach local optimum.

$ \
Looping \quad m \quad samples :\
w= w - \frac{\partial J(w,b)}{\partial w} \
b= b - \frac{\partial J(w,b)}{\partial b}
$

Bascically, for each training sample $(x,y)$, we compare the $y$ with $\hat{y}$ from output layer. Get the difference, and compute which part of difference is from which parameter( by partial derivative). And then update the parameter accordingly.

And the derivative of sigmoid function can be calcualted using chaining method:
For each training sample, let $\hat{y}=a = \sigma(z)$
\[ \frac{\partial L(a,y)}{\partial w} =
\frac{\partial L(a,y)}{\partial a} \cdot
\frac{\partial a}{\partial z} \cdot
\frac{\partial z}{\partial w}\]
Where
1.$\frac{\partial L(a,y)}{\partial a}
=-\frac{y}{a} + \frac{1-y}{1-a} $
Given loss function is
$L(a,y) = -(ylog(a) + (1-y)log(1-a))$

2.$\frac{\partial a}{\partial z} = \sigma(z)(1-\sigma(z)) = a(1-a)$.
See above for sigmoid features.

3.$\frac{\partial z}{\partial w} = x$
Put them together we get :
\[ \frac{\partial L(a,y)}{\partial w} = (a-y)x\]

This is exactly the update we will have from each training sample $(x,y)$ to the parameter $w$.

5. Entire work flow.

Summarizing everything. A 1-layer binary classification neural network is trained as following:

Forward propagation: From $x$, we calculate $\hat{y}= \sigma(z)$
Calculate the cost function $J(w,b)$
Back propagation: update parameter $(w,b)$ using gradient descent.
keep doing above until the cost function stop improving (improment < certain threshold)

6. what's next?

When NN has more than 1 layer, there will be hidden layers in between. And to get non-linear transformation of x, we also need different types of activation function for hidden layer.

However sigmoid is rarely used as hidden layer activation function for following reasons

vanishing gradient descent
the reason we can't use [left] as activation function is because the gradient is 0 when $z>1 ,z <0$.
Sigmoid only solves this problem partially. Becuase $gradient \to 0$, when $z>1 ,z <0$.

$p(y=1\\|x)= max\{0,min\{1,z\}\}$	$p(y=1\\|x)= \sigma(z)$

non-zero centered

To be continued

Reference

Ian Goodfellow, Yoshua Bengio, Aaron Conrville, "Deep Learning"
Deeplearning.ai https://www.deeplearning.ai/

DeepLearning Intro - sigmoid and shallow NN的更多相关文章

Sigmoid function in NN
X = [ones(m, ) X]; temp = X * Theta1'; t = size(temp, ); temp = [ones(t, ) temp]; h = temp * Theta2' ...
Deeplearning - Overview of Convolution Neural Network
Finally pass all the Deeplearning.ai courses in March! I highly recommend it! If you already know th ...
DeepLearning - Regularization
I have finished the first course in the DeepLearnin.ai series. The assignment is relatively easy, bu ...
DeepLearning - Forard & Backward Propogation
In the previous post I go through basic 1-layer Neural Network with sigmoid activation function, inc ...
Pytorch_第六篇_深度学习 (DeepLearning) 基础 [2]---神经网络常用的损失函数
深度学习 (DeepLearning) 基础 [2]---神经网络常用的损失函数 Introduce 在上一篇"深度学习 (DeepLearning) 基础 [1]---监督学习和无监督学习 ...
0802_转载-nn模块中的网络层介绍
0802_转载-nn 模块中的网络层介绍目录一.写在前面二.卷积运算与卷积层 2.1 1d 2d 3d 卷积示意 2.2 nn.Conv2d 2.3 转置卷积三.池化层四.线性层五.激活函 ...
Neural Networks and Deep Learning
Neural Networks and Deep Learning This is the first course of the deep learning specialization at Co ...
关于BP算法在DNN中本质问题的几点随笔 [原创 by 白明] 微信号matthew-bai
随着deep learning的火爆,神经网络(NN)被大家广泛研究使用.但是大部分RD对BP在NN中本质不甚清楚,对于为什这么使用以及国外大牛们是什么原因会想到用dropout/sigmoid ...
Pytorch实现UNet例子学习
参考:https://github.com/milesial/Pytorch-UNet 实现的是二值汽车图像语义分割,包括 dense CRF 后处理. 使用python3,我的环境是python3. ...

随机推荐

DIAView组态软件笔记
1.为了节省成本,我们往往会在PLC将多个开关量整合到同一个word中,这样关联的变量可以从原有的16个变成现在的一个.这样做带来的麻烦就是需要我们在脚本中自己来解析出数据,通过对2求余(mod 2) ...
C#中小写人民币转大写
/// <summary> /// 转换成大写人民币 /// </summary> /// <param name="myMoney">< ...
Zabbix——设置报警阈值
前提条件: 1. Zabbix-server 版本为4.0 2.邮件告警正常使用 3. 阈值改为1分钟进行邮件发送点击: 找到agent,点击触发器: 设置网络ping包进行检测
.Net core 使用TimeJob
在我以前的文章中有一个.Net core使用Quartz.Net ,一开始我们的设想就是定时操作数据库,所以有很多实现方法,后来发现TimeJob可以同样实现我们的需求,而且更简便. 所以我们就使用了 ...
windows10上安装mysql
环境:windwos 10(1511) 64bit.mysql 5.7.14 一.下载mysql 1. 在浏览器里打开mysql的官网http://www.mysql.com/ 2. 进入页面顶部的& ...
RAID磁盘阵列的原理
RAID概念磁盘阵列(Redundant Arrays of Independent Disks,RAID),有“独立磁盘构成的具有冗余能力的阵列”之意.磁盘阵列是由很多价格较便宜的磁盘,以硬件(R ...
Django学习笔记4-csrf防护
1.CSRF验证失败. 请求被中断. 原因是django为了在用户提交表单时防止跨站攻击所做的保护什么是 CSRF CSRF, Cross Site Request Forgery, 跨站点伪造请求 ...
SQL语言简单总结
常用的Sql语言总结: 1. create datebase datebaseName //创建数据库 2. drop datebase datebaseName // ...
Mysql是否开启binlog日志&开启方法
运行sql show variables like 'log_bin'; 如果Value 为 OFF 则为开启日志文件如何开启mysql日志? 找到my,cnf 中 [mysqld] 添加如下 ...
TypeScript : 语法及特性
当let声明一个变量的时候它使用的词法作用域或者是块作用域.块作用域指的就是他们包含的块以外的不能访问. const声明:是let声明有相同的作用域规则,但是它被赋值后不能再被改变.类似于java的f ...

DeepLearning Intro - sigmoid and shallow NN