DeepLearning Intro - sigmoid and shallow NN
This is a series of Machine Learning summary note. I will combine the deep learning book with the deeplearning open course . Any feedback is welcomed!
First let's go through some basic NN concept using Bernoulli classification problem as an example.
Activation Function
1.Bernoulli Output
1. Definition
When dealing with Binary classification problem, what activavtion function should we use in the output layer?
Basically given \(x \in R^n\), How to get \(P(y=1|x)\) ?
2. Loss function
Let $\hat{y} = P(y=1|x) $, We would expect following output
\[P(y|x) = \begin{cases}
\hat{y} & \quad when & y= 1\\
1-\hat{y} & \quad when & y= 0\\
\end{cases}
\]
Above can be simplified as
\[P(y|x)= \hat{y}^{y}(1-\hat{y})^{1-y}\]
Therefore, the maximum likelihood of m training samples will be
\[\theta_{ML} = argmax\prod^{m}_{i=1}P(y^i|x^i)\]
As ususal we take the log of above function and get following. Actually for gradient descent log has other advantages, which we will discuss later.
\[log(\theta_{ML}) = argmax\sum^{m}_{i=1} ylog(\hat{y})+(1-y)log(1-\hat{y}) \]
And the cost function for optimization is following
\[J(w,b) = \sum ^{m}_{i=1}L(y^i,\hat{y}^i)= -\sum^{m}_{i=1}ylog(\hat{y})+(1-y)log(1-\hat{y})\]
The cost function is the sum of loss from m training samples, which measures the performance of classification algo.
And yes here it is exactly the negative of log likelihood. While Cost function can be different from negative log likelihood, when we apply regularization. But here let's start with simple version.
So here comes our next problem, how can we get 1-dimension $ log(\hat{y})$, given input \(x\), which is n-dimension vector ?
3. Activtion function - Sigmoid
Let \(h\) denotes the output from the previous hidden layer that goes into the final output layer. And a linear transformation is applied to \(h\) before activation function.
Let \(z = w^Th +b\)
The assumption here is
\[log(\hat{y}) = \begin{cases}
z & \quad when & y= 1\\
0 & \quad when & y= 0\\
\end{cases}
\]
Above can be simplified as
\[log(\hat{y}) = yz\quad \to \quad \hat{y} = exp(yz)\]
This is an unnormalized distribution of \(\hat{y}\). Because \(y\) denotes probability, we need to further normalize it to $ [0,1]$.
\[\hat{y} = \frac{exp(yz)} {\sum^1_{y=0}exp(yz)} \\
=\frac{exp(z)}{1+exp(z)}\\
\quad \quad \quad \quad = \frac{1}{1+exp(-z)} = \sigma(z)
\]
Bingo! Here we go - Sigmoid Function: \(\sigma(z) = \frac{1}{1+exp(-z)}\)
\[p(y|x) = \begin{cases}
\sigma(z) & \quad when & y= 1\\
1-\sigma(z) & \quad when & y= 0\\
\end{cases}
\]
Sigmoid function has many pretty cool features like following:
\[ 1- \sigma(x) = \sigma(-x) \\
\frac{d}{dx} \sigma(x) = \sigma(x)(1-\sigma(x)) \\
\quad \quad = \sigma(x)\sigma(-x)
\]
Using the first feature above, we can further simply the bernoulli output into following:
\[p(y|x) = \sigma((2y-1)z)\]
4. gradient descent and back propagation
Now we have target cost fucntion to optimize. How does the NN learn from training data? The answer is -- Back Propagation.
Actually back propagation is not some fancy method that is designed for Neural Network. When training sample is big, we can use back propagation to train linear regerssion too.
Back Propogation is iteratively using the partial derivative of cost function to update the parameter, in order to reach local optimum.

$ \
Looping \quad m \quad samples :\
w= w - \frac{\partial J(w,b)}{\partial w} \
b= b - \frac{\partial J(w,b)}{\partial b}
$
Bascically, for each training sample \((x,y)\), we compare the \(y\) with \(\hat{y}\) from output layer. Get the difference, and compute which part of difference is from which parameter( by partial derivative). And then update the parameter accordingly.

And the derivative of sigmoid function can be calcualted using chaining method:
For each training sample, let \(\hat{y}=a = \sigma(z)\)
\[ \frac{\partial L(a,y)}{\partial w} =
\frac{\partial L(a,y)}{\partial a} \cdot
\frac{\partial a}{\partial z} \cdot
\frac{\partial z}{\partial w}\]
Where
1.$\frac{\partial L(a,y)}{\partial a}
=-\frac{y}{a} + \frac{1-y}{1-a} $
Given loss function is
\(L(a,y) = -(ylog(a) + (1-y)log(1-a))\)
2.\(\frac{\partial a}{\partial z} = \sigma(z)(1-\sigma(z)) = a(1-a)\).
See above for sigmoid features.
3.\(\frac{\partial z}{\partial w} = x\)
Put them together we get :
\[ \frac{\partial L(a,y)}{\partial w} = (a-y)x\]
This is exactly the update we will have from each training sample \((x,y)\) to the parameter \(w\).
5. Entire work flow.
Summarizing everything. A 1-layer binary classification neural network is trained as following:
- Forward propagation: From \(x\), we calculate \(\hat{y}= \sigma(z)\)
- Calculate the cost function \(J(w,b)\)
- Back propagation: update parameter \((w,b)\) using gradient descent.
- keep doing above until the cost function stop improving (improment < certain threshold)
6. what's next?
When NN has more than 1 layer, there will be hidden layers in between. And to get non-linear transformation of x, we also need different types of activation function for hidden layer.
However sigmoid is rarely used as hidden layer activation function for following reasons
- vanishing gradient descent
the reason we can't use [left] as activation function is because the gradient is 0 when \(z>1 ,z <0\).
Sigmoid only solves this problem partially. Becuase \(gradient \to 0\), when \(z>1 ,z <0\).
| \(p(y=1\|x)= max\{0,min\{1,z\}\}\) | \(p(y=1\|x)= \sigma(z)\) |
|---|---|
![]() |
![]() |
- non-zero centered
To be continued
Reference
- Ian Goodfellow, Yoshua Bengio, Aaron Conrville, "Deep Learning"
- Deeplearning.ai https://www.deeplearning.ai/
DeepLearning Intro - sigmoid and shallow NN的更多相关文章
- Sigmoid function in NN
X = [ones(m, ) X]; temp = X * Theta1'; t = size(temp, ); temp = [ones(t, ) temp]; h = temp * Theta2' ...
- Deeplearning - Overview of Convolution Neural Network
Finally pass all the Deeplearning.ai courses in March! I highly recommend it! If you already know th ...
- DeepLearning - Regularization
I have finished the first course in the DeepLearnin.ai series. The assignment is relatively easy, bu ...
- DeepLearning - Forard & Backward Propogation
In the previous post I go through basic 1-layer Neural Network with sigmoid activation function, inc ...
- Pytorch_第六篇_深度学习 (DeepLearning) 基础 [2]---神经网络常用的损失函数
深度学习 (DeepLearning) 基础 [2]---神经网络常用的损失函数 Introduce 在上一篇"深度学习 (DeepLearning) 基础 [1]---监督学习和无监督学习 ...
- 0802_转载-nn模块中的网络层介绍
0802_转载-nn 模块中的网络层介绍 目录 一.写在前面 二.卷积运算与卷积层 2.1 1d 2d 3d 卷积示意 2.2 nn.Conv2d 2.3 转置卷积 三.池化层 四.线性层 五.激活函 ...
- Neural Networks and Deep Learning
Neural Networks and Deep Learning This is the first course of the deep learning specialization at Co ...
- 关于BP算法在DNN中本质问题的几点随笔 [原创 by 白明] 微信号matthew-bai
随着deep learning的火爆,神经网络(NN)被大家广泛研究使用.但是大部分RD对BP在NN中本质不甚清楚,对于为什这么使用以及国外大牛们是什么原因会想到用dropout/sigmoid ...
- Pytorch实现UNet例子学习
参考:https://github.com/milesial/Pytorch-UNet 实现的是二值汽车图像语义分割,包括 dense CRF 后处理. 使用python3,我的环境是python3. ...
随机推荐
- 撸一个简单的MVVM例子
我个人以为mvvm框架里面最重要的一点就是VM这部分,它要与Model层建立联系,将Model层转换成可以被View层识别的数据结构:其次也要同View建立联系,将数据及时更新到View层上,并且响应 ...
- SQLServer如何批量替换某一列中的某个字符串
我们在开发系统的时候经常会碰到类似如下这样的情况:比如我有一张数据表 假如我现在要把红圈中这列的的http://www.mylanqiu.com/ 这个字符串批量替换成mylanqiu 这个字符串,这 ...
- SQL中对连表查询的建议
多表连查时,如果存在多个唯一键可以做关联,尽可能选择有意义的code或name,能不选择无意义的id或者uuid最好! 所以在存储的时候也是这样,并且从始至终保持一致性.这样既降低了维护和阅读的难度, ...
- OC中自定义init方法
---恢复内容开始--- 我们知道,在函数中实例化一个对象,大多数会同时进行初始化,如 Person *p =[ [Person alloc]init]; 此时已经进行了初始化,使用init方法,那么 ...
- 【Django笔记二】Django2.0配置模板和静态文件
一.环境版本信息: 操作系统:windows10 Django版本:2.0.5 Python版本:3.6.4 二.创建模板 1.在my_project文件夹下新建文件夹templates用于存放模板文 ...
- eclipse中误删tomcat后,文件都报错,恢复server时无法选择tomcat7.0解决办法
创建Tomcat v7.0 Server 不能进行下一步. 解决方法: 1.退出 eclipse 2.到[工程目录下]/.metadata/.plugins/org.eclipse.core.runt ...
- wps for linux 安装后系统缺失字体安装配置
错误提示: 解决方法: 从http://bbs.wps.cn/thread-22355435-1-1.html下载字体库,离线版本:(链接: https://pan.baidu.com/s/1i5dz ...
- javascript最常用的对象创建方式
//第一种 function Demo(){ var obj=new Object(); obj.name="Yubaba"; obj.age=12; obj.firstF=fun ...
- HaoheDI让ETL变得简单
HaoheDI(昊合数据整合平台)http://www.haohedi.com,产品基于BS架构,开发运维均极为简单,可快速搭建ETL平台,广泛支持各种数据库.文本文件.SAP和Hadoop,开发数据 ...
- QEP之init()和dispatch()流程图
抽象状态机类QFsm或QHsm有一个函数指针,用于在继承的具体状态机类中指向具体的状态函数,其有两个对外的接口函数init()和dispatch(),其工作原理是理解状态机处理事件过程的关键. 具体状 ...

