Deep Learning--week1~week3
week1
一张图片,设像素为64*64, 颜色通道为红蓝绿三通道,则对应3个64*64实数矩阵
为了用向量表示这些矩阵,将这些矩阵的像素值展开为一个向量x作为算法的输入
从红色到绿色再到蓝色,依次按行一个个将元素读到向量x中,则x是一个\(1\times64*64*3\)的矩阵,也就是一个64*64*3维的向量
用 \(n_x = 64*64*3\) 表示特征向量x的维度
而所有的训练样本表示成:\(X = \begin{bmatrix}\mid & \mid &\mid &&\mid \\ x^{(1)}& x^{(2)}& x^{(3)}& \cdots & x^{(m)}\\ \mid & \mid &\mid &&\mid \end{bmatrix}\) (\(n_x \times m\)矩阵)
(注意不是\(X = \begin{bmatrix} (x^{(1)})^T\\ \vdots \\ (x^{(m)})^T \end{bmatrix}\) ,用上面的方法运算会简单点)
\(Y=\begin{bmatrix}y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}\)
之前的机器学习课上的\(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)的形式不再使用,而用\(\large b = \theta_0, \; w = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)代替( it will be easier to just keep \(b\) and \(w\) as separate parameters )
则output : \(\large \hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b),{\rm where\;}\sigma(z^{(i)}) = \frac{1}{1+e^{-z^{(i)}}}\)
\(\text{Given \{}(x^{(1)}, y^{(1)}),\dots,(x^{(m)},y^{(m)})\text{\}, want } \hat{y}^{(i)} \approx y^{(i)}\)
week2
Loss Function/Error Function
Loss Function/Error Function(误差函数): used to measure how well our algorism is doing
\]
Cost Function
\]
Gradient Descent
看ML的笔记,实质上是一样的
Vectorization:
#Non-vecotrized
#slow
z = 0
for i in range(n_x):
z += w[i] * x[i]
z += b
#Vectorized
#import numpy as np
z = np.dot(w,x) + b
whenever possible, avoid explicit for-loops(因为是解释型语言), 用numpy带的行数可以简洁而高效地实现
Vectorizing Logistic Regression
\(X = \begin{bmatrix} \lvert & \lvert & \cdots & \lvert \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \lvert & \lvert & \cdots & \lvert \end{bmatrix}, \mathbb{R}^{n_x \times m}\)
\(Z = \begin{bmatrix}z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix} = w^TX + \begin{bmatrix}b &b & \cdots & b \end{bmatrix}\)
\(z^{(i)}\) 是 sigmoid function的输入值
\(A = \begin{bmatrix}a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix} = \sigma(Z)\)
(这里的不同上标的元素似乎实际是在同一个layer中的,跟ML课上不大一样。 \(a^{[j](i)}\)中方括号括起来的是层数,圆括号括起来的是第\(i\)个训练实例)
import numpy as np
z = np.dot(w,x) + b\
#Python automatically takes this real number b and expands it out to this 1*m row vector
Gradient Output
\({\rm d}z^{(i)} = a^{(i)} - y^{(i)}\)
\(\begin{align}{\rm d}Z &= \begin{bmatrix}{\rm d}z^{(1)} & {\rm d}z^{(2)} & \cdots & {\rm d}z^{(m)} \end{bmatrix} \\&= A-Y = \begin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & \cdots & a^{(m)} - y^{(m)} \end{bmatrix} \end{align}\)
${\rm d}b = $1/m*np.sum(dZ)
\({\rm d}w = \frac{1}{m}X{\rm d}Z^T\)
单次迭代免for-loop法(vectorize):
\downarrow&\begin{cases}
Z & = w^T+b\\
& = {\rm np.dot(}w{\rm .T, }X{\rm)}\\
A & = \sigma(Z)\\
{\rm d}Z &= A-Y \\
{\rm d}w &= \frac{1}{m}X{\rm d}Z^T\\
\end{cases}\\\\
w& := w - \alpha{\rm d}w\\
b &:= b - \alpha{\rm d}b
\end{align}
\]
若要多次迭代,最外层的显式for-loop是不可避免的
Broadcasting
用reshape()确保矩阵的尺寸
举个例子说明numpy 的 broadcasting机制:
>>> import numpy as np
>>> a = np.arange(0,6).reshape(6,1)
>>> a
array([[0],
[1],
[2],
[3],
[4],
[5]])
>>> b = np.arange(0,5)
>>> b
array([0, 1, 2, 3, 4])
>>> a * b
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16],
[ 0, 5, 10, 15, 20]])
>>> a + b
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
也就是说matrix+-*/number/vector时,numpy会将number/vector通过自我复制拓展成合法的矩阵
注意这会导致 在期望抛出异常的地方 不抛出异常而是发生奇怪的BUG:
比如 有时我想 行向量和列向量相加时抛出异常, 但是numpy却用broadcasting机制把它给算出来了...
numpy的坑
import numpy as np
a = np.random.randn(5)
>>> a
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
>>> a.shape
(5,)
# which is called a rank 1 array in Python and is neither a row vector nor a column vector
>>> a.T
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
# which is same as 'a' i self
>>> np.dot(a,a.T)
3.2288264718632416
# it is a number rather than a matrix in expectation(just like array([[55]]))
不要使用形如(5,)或者(n,)这样的“rank 1 array”, 而是显式地说明是\(m \times n\)的矩阵:
>>> a = np.random.randn(5,1)
>>> a
array([[ 0.7643396 ],
[-1.66945103],
[ 1.66235712],
[-0.06892102],
[-1.61347409]])
>>> a.T
array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])
注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])和array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别(后者有两个方括号), 这说明前者是秩为1的数组而后者是一个真正的\(1 \times 5\)矩阵(就像C里一样矩阵是用二维数组表示的)(另外我觉得rank 1 array翻译为一维数组更为准确)
It can use assert() statement to make sure the dimension of one of vectors.
When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.
Logistic Regression Cost Function
\begin{array}{l}
\text{If y=1:}\quad p(y|x)=\hat{y}\\
\text{If y=0:}\quad p(y|x)=1-\hat{y}
\end{array}
\right\}
p(y|x) = \hat{y}^y\cdot (1-\hat{y})^{1-y}\\
\,\\
\begin{align}
\therefore {\rm log}(p(y|x)) &= y\cdot log\,\hat{y} + (1-y)\cdot log\, (1-\hat{y}) \\
&= -\mathcal{L}(\hat{y},y)
\end{align}
\]
所以:
{\rm log }[p(\text{labels in training set})] &= {\rm log } \prod_{i=1}^mp(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m {\rm log\,}p(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m-\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\
&=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\end{align}\\
\text{Cost: }J(w,b) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\]
maximum likelihood estimation (极大似然估计)
week3
\(Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}\begin{bmatrix} | & | & | & \\ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & \cdots \\ | & | & | & \end{bmatrix} + b^{[j]} = \begin{bmatrix} | & | & | & \\ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & \cdots \\ | & | & | & \end{bmatrix}\)
其中\((i) \in [(1),(m)],\quad [j] \in [[1],[n]],\quad X = A^{[0]}\)
Other Activation Function
①\(tanh(z)\) function:
\]
\(tanh(z)\) 可以把 数据中心化 为 0 (Sigmoid Function 将数据中心化为 0.5)
之后只有 \(0 \le \hat{y} \le 1\) (即二元分类问题)才用 Sigmoid Function,因为\(tanh\)几乎严格优于Sigmoid...
②Rectified Linear Unit(线性整流函数, ReLU):\(Q = max(0,z)\)
When not sure what to use for your hidden layer, can use the ReLU function
Disadvantage of ReLU: when \(z\) is negative, the value is 0.
It can use what names Leaky ReLU to overcome the disadvantage below.
Leaky ReLU: \(a = max(0.01z, z)\)
ReLU可以使得斜率不变(Sigmoid 和 \(tanh(z)\) 在\(z\rightarrow \infin\)时斜率趋向于0,会使得学习速度下降)
最常用的 Activation Function
③Tannish Function(双曲函数)
当且仅当要解决回归问题的时候,在生成到output layer才使用线性的Activation Function(\(g(z)=z\)) ,比如预测房价时,y不限于 0 和 1(\(y \in \mathbb{R}\)),所以可以用\(g(z)=z\) 输出,隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU
Derivatives of Activation Functions
- Sigmoid:
- \(\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))\)
\(tanh(z)\): - \(g\prime(z) = 1-(tanh(z))^2\)
- \(\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))\)
- ReLU:
- \(g\prime(z) = \begin{cases}1, \text{if }z\ge0 \\0, \text{if }z\lt0 \end{cases}\)
Gradient Descents For Neural Networks
Parameters : \(w^{[1]},b^{[1]},w^{[2]},b^{[2]}\)
Cost Function : \(J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y},y)\)
Gradient Function:
&\text{Repeat \{}\\
&\quad \text{compute predicts} (\hat{y}^{(i)}, i = 1,\dots,m) \\
&\quad {\rm d}w^{[1]} = \frac{\partial J}{\partial w^{[1]}}, {\rm d}b^{[1]} = \frac{\partial J}{\partial b^{[1]}},\dots\\
&\quad w^{[1]} = w^{[1]} - \alpha {\rm d}w^{[1]}\\
&\quad b^{[1]} = b^{[1]} - \alpha {\rm d}b^{[1]}\\
&\quad w^{[2]} = w^{[2]} - \alpha {\rm d}w^{[2]}\\
&\quad b^{[2]} = b^{[2]} - \alpha {\rm d}b^{[2]}\\
\text{\}}
\end{align}
\]
Forward Propagation :
Z^{[1]} &= w^{[1]}X + b^{[1]}\\
A^{[1]} &= g^{[1]}(z^{[1]})\\
Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} &= g^{[2]}(z^{[2]}) = \sigma(Z^{[2]})
\end{align}
\]
Backward Propagation :
{\rm d}Z^{[2]} &= A^{[2]} - Y, \quad Y = \begin{bmatrix}y^{[1]} & y^{[2]} & \dots & y^{[m]}\end{bmatrix}\\
{\rm d}w^{[2]} &= \frac{1}{m} {\rm d}z^{[2]} A^{[1]T}\\
{\rm d}d^{[2]} &= \frac{1}{m}\text{np.sum(d}z^{[2]}\text{,axis=1,keepdims=True)}\\
{\rm d}Z^{[1]} &= w^{[2]T}{\rm d}Z^{[2]}\; .* \; g^{[1]\prime}(Z^{[1]})\\
{\rm d}w^{[1]} &= \frac{1}{m} {\rm d}Z^{[1]}X^T\\
{\rm d}d^{[1]} &= \frac{1}{m}\text{np.sum(d}z^{[1]}\text{,axis=1,keepdims=True)}\\
\end{align}
\]
注:axis = 1 means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.
又注:\(由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]\prime}(z) = 1-a^2,\;所以 g^{[1]\prime}(Z^{[1]}) = 1-(A^{[1]})^2\), 即:\(Z^{[1]} = w^{[2]T}{\rm d}Z^{[2]}\; .* \; (1-(A^{[1]})^2\)
Random Initialization
For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.
Deep Learning--week1~week3的更多相关文章
- Coursera, Deep Learning 1, Neural Networks and Deep Learning - week1, Introduction to deep learning
整个deep learing 系列课程主要包括哪些内容 Intro to Deep learning
- Neural Networks and Deep Learning(week3)Planar data classification with one hidden layer(基于单隐藏层神经网络的平面数据分类)
Planar data classification with one hidden layer 你会学习到如何: 用单隐层实现一个二分类神经网络 使用一个非线性激励函数,如 tanh 计算交叉熵的损 ...
- 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记
Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...
- Coursera, Deep Learning 4, Convolutional Neural Networks - week1
CNN 主要解决 computer vision 问题,同时解决input X 维度太大的问题. Edge detection 下面演示了convolution 的概念 下图的 vertical ed ...
- Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Gradient Checking)
声明:所有内容来自coursera,作为个人学习笔记记录在这里. Gradient Checking Welcome to the final assignment for this week! In ...
- Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Regularization)
声明:所有内容来自coursera,作为个人学习笔记记录在这里. Regularization Welcome to the second assignment of this week. Deep ...
- Deep learning:五十一(CNN的反向求导及练习)
前言: CNN作为DL中最成功的模型之一,有必要对其更进一步研究它.虽然在前面的博文Stacked CNN简单介绍中有大概介绍过CNN的使用,不过那是有个前提的:CNN中的参数必须已提前学习好.而本文 ...
- 【深度学习Deep Learning】资料大全
最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books by Yoshua Bengio, Ian Goodfellow and Aaron C ...
- 《Neural Network and Deep Learning》_chapter4
<Neural Network and Deep Learning>_chapter4: A visual proof that neural nets can compute any f ...
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
http://m.blog.csdn.net/blog/wu010555688/24487301 本文整理了网上几位大牛的博客,详细地讲解了CNN的基础结构与核心思想,欢迎交流. [1]Deep le ...
随机推荐
- 为多维数组添加一列以及reshape用法注意
https://blog.csdn.net/orangefly0214/article/details/80934008参考这个了链接 下面是我自己用到的代码,亲测可用 # data = pd.rea ...
- css 修改svg图标的颜色(不修改fill)
给icon加样式 (利用原图标的阴影区域,同时将原图标移动超过之前父元素范围)filter: drop-shadow(red 80px 0);transform: translateX(-80px); ...
- [Javascript]网页链接加上时间戳防止串用户
最近客服来报,一批用户访问公司网站的时候,由于其网络环境有代理服务器,导致A用户看到B用户的信息,这是非常尴尬的事情.解决的方法也很容易,给网址加上时间戳就可以了,用JS就能实现. JS代码如下 // ...
- T-SQL语言基础(1)之理论背景
从学校就开始接触和使用 SQL 了,但一直没有怎么细细去了解它,最近入职的公司比较重 T-SQL 部分,所以就准备系统的学习一下. 买了一本<Microsoft SQL Server 2008 ...
- 做一个有产品思维的研发:部署(Tomcat配置,Nginx配置,JDK配置)
每天10分钟,解决一个研发问题. 如果你想了解我在做什么,请看<做一个有产品思维的研发:课程大纲>传送门:https://www.cnblogs.com/hunttown/p/104909 ...
- spring boot集成aop实现日志记录
1.pom依赖 <dependency> <groupId>org.springframework.boot</groupId> <artifactId> ...
- spring boot入门小案例
spring boot 入门小案例搭建 (1) 在Eclipse中新建一个maven project项目,目录结构如下所示: cn.com.rxyb中存放spring boot的启动类,applica ...
- EXCEL对比在职员工与离职员工
EXCEL 在B1中 填写这个 =VLOOKUP(A1,C:C,1,0) 然后往下拉 只要有出现#N/A 说明已经离职了 公司需要
- 从网卡发送数据再谈TCP/IP协议—网络传输速度计算-网卡构造
在<在深谈TCP/IP三步握手&四步挥手原理及衍生问题—长文解剖IP>里面提到 单个TCP包每次打包1448字节的数据进行发送(以太网Ethernet最大的数据帧是1518字节,以 ...
- PHP百度AI的OCR图片文字识别
第一步可定要获取百度的三个东西 要到百度AI网站(http://ai.baidu.com/)去注册 然后获得 -const APP_ID = '请填写你的appid'; -const API_KEY ...