week1

一张图片,设像素为64*64, 颜色通道为红蓝绿三通道,则对应3个64*64实数矩阵

为了用向量表示这些矩阵,将这些矩阵的像素值展开为一个向量x作为算法的输入

从红色到绿色再到蓝色,依次按行一个个将元素读到向量x中,则x是一个\(1\times64*64*3\)的矩阵,也就是一个64*64*3维的向量

用 \(n_x = 64*64*3\) 表示特征向量x的维度

而所有的训练样本表示成:\(X = \begin{bmatrix}\mid & \mid &\mid &&\mid \\ x^{(1)}& x^{(2)}& x^{(3)}& \cdots & x^{(m)}\\ \mid & \mid &\mid &&\mid \end{bmatrix}\) (\(n_x \times m\)矩阵)

注意不是\(X = \begin{bmatrix} (x^{(1)})^T\\ \vdots \\ (x^{(m)})^T \end{bmatrix}\) ,用上面的方法运算会简单点)

\(Y=\begin{bmatrix}y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}\)

之前的机器学习课上的\(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)的形式不再使用,而用\(\large b = \theta_0, \; w = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)代替( it will be easier to just keep \(b\) and \(w\) as separate parameters )

则output : \(\large \hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b),{\rm where\;}\sigma(z^{(i)}) = \frac{1}{1+e^{-z^{(i)}}}\)

\(\text{Given \{}(x^{(1)}, y^{(1)}),\dots,(x^{(m)},y^{(m)})\text{\}, want } \hat{y}^{(i)} \approx y^{(i)}\)


week2

Loss Function/Error Function

Loss Function/Error Function(误差函数): used to measure how well our algorism is doing

\[{\cal L}(\hat{y},y) = -y\cdot log(\hat{y})-(1-y)\cdot log(1-\hat{y})
\]

Cost Function

\[J(w,b) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,\hat{y}^{(i)})+(1-y^{(i)})\, log\,(1-\hat{y}^{(i)})]
\]

Gradient Descent

​ 看ML的笔记,实质上是一样的

Vectorization:

#Non-vecotrized
#slow
z = 0
for i in range(n_x):
z += w[i] * x[i]
z += b #Vectorized
#import numpy as np
z = np.dot(w,x) + b

whenever possible, avoid explicit for-loops(因为是解释型语言), 用numpy带的行数可以简洁而高效地实现

Vectorizing Logistic Regression

\(X = \begin{bmatrix} \lvert & \lvert & \cdots & \lvert \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \lvert & \lvert & \cdots & \lvert \end{bmatrix}, \mathbb{R}^{n_x \times m}\)

\(Z = \begin{bmatrix}z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix} = w^TX + \begin{bmatrix}b &b & \cdots & b \end{bmatrix}\)

\(z^{(i)}\) 是 sigmoid function的输入值

\(A = \begin{bmatrix}a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix} = \sigma(Z)\)

(这里的不同上标的元素似乎实际是在同一个layer中的,跟ML课上不大一样。 \(a^{[j](i)}\)中方括号括起来的是层数,圆括号括起来的是第\(i\)个训练实例)

import numpy as np
z = np.dot(w,x) + b\
#Python automatically takes this real number b and expands it out to this 1*m row vector

Gradient Output

\({\rm d}z^{(i)} = a^{(i)} - y^{(i)}\)

\(\begin{align}{\rm d}Z &= \begin{bmatrix}{\rm d}z^{(1)} & {\rm d}z^{(2)} & \cdots & {\rm d}z^{(m)} \end{bmatrix} \\&= A-Y = \begin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & \cdots & a^{(m)} - y^{(m)} \end{bmatrix} \end{align}\)

${\rm d}b = $1/m*np.sum(dZ)

\({\rm d}w = \frac{1}{m}X{\rm d}Z^T\)

单次迭代免for-loop法(vectorize):

\[\begin{align}
\downarrow&\begin{cases}
Z & = w^T+b\\
& = {\rm np.dot(}w{\rm .T, }X{\rm)}\\
A & = \sigma(Z)\\
{\rm d}Z &= A-Y \\
{\rm d}w &= \frac{1}{m}X{\rm d}Z^T\\
\end{cases}\\\\
w& := w - \alpha{\rm d}w\\
b &:= b - \alpha{\rm d}b
\end{align}
\]

若要多次迭代,最外层的显式for-loop是不可避免的

Broadcasting

reshape()确保矩阵的尺寸

举个例子说明numpy 的 broadcasting机制:

>>> import numpy as np
>>> a = np.arange(0,6).reshape(6,1)
>>> a
array([[0],
[1],
[2],
[3],
[4],
[5]])
>>> b = np.arange(0,5)
>>> b
array([0, 1, 2, 3, 4])
>>> a * b
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16],
[ 0, 5, 10, 15, 20]])
>>> a + b
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])

也就是说matrix+-*/number/vector时,numpy会将number/vector通过自我复制拓展成合法的矩阵

注意这会导致 在期望抛出异常的地方 不抛出异常而是发生奇怪的BUG:

​ 比如 有时我想 行向量和列向量相加时抛出异常, 但是numpy却用broadcasting机制把它给算出来了...

numpy的坑

import numpy as np
a = np.random.randn(5)
>>> a
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
>>> a.shape
(5,)
# which is called a rank 1 array in Python and is neither a row vector nor a column vector >>> a.T
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
# which is same as 'a' i self >>> np.dot(a,a.T)
3.2288264718632416
# it is a number rather than a matrix in expectation(just like array([[55]]))

不要使用形如(5,)或者(n,)这样的“rank 1 array”, 而是显式地说明是\(m \times n\)的矩阵:

>>> a = np.random.randn(5,1)
>>> a
array([[ 0.7643396 ],
[-1.66945103],
[ 1.66235712],
[-0.06892102],
[-1.61347409]])
>>> a.T
array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])

注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别(后者有两个方括号), 这说明前者是秩为1的数组而后者是一个真正的\(1 \times 5\)矩阵(就像C里一样矩阵是用二维数组表示的)(另外我觉得rank 1 array翻译为一维数组更为准确)

It can use assert() statement to make sure the dimension of one of vectors.

When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.

Logistic Regression Cost Function

\[\left.
\begin{array}{l}
\text{If y=1:}\quad p(y|x)=\hat{y}\\
\text{If y=0:}\quad p(y|x)=1-\hat{y}
\end{array}
\right\}
p(y|x) = \hat{y}^y\cdot (1-\hat{y})^{1-y}\\
\,\\
\begin{align}
\therefore {\rm log}(p(y|x)) &= y\cdot log\,\hat{y} + (1-y)\cdot log\, (1-\hat{y}) \\
&= -\mathcal{L}(\hat{y},y)
\end{align}
\]

所以:

\[\begin{align}
{\rm log }[p(\text{labels in training set})] &= {\rm log } \prod_{i=1}^mp(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m {\rm log\,}p(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m-\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\
&=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\end{align}\\
\text{Cost: }J(w,b) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\]

maximum likelihood estimation (极大似然估计)


week3

\(Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}\begin{bmatrix} | & | & | & \\ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & \cdots \\ | & | & | & \end{bmatrix} + b^{[j]} = \begin{bmatrix} | & | & | & \\ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & \cdots \\ | & | & | & \end{bmatrix}\)

其中\((i) \in [(1),(m)],\quad [j] \in [[1],[n]],\quad X = A^{[0]}\)

Other Activation Function

①\(tanh(z)\) function:

\[a= tanh(z)=\frac{e^z -e^{-z}}{e^z +e^{-z}}\text{ , when } tanh(z) \in (-1,1), tanh(0)=0
\]

​ \(tanh(z)\) 可以把 数据中心化 为 0 (Sigmoid Function 将数据中心化为 0.5)

​ 之后只有 \(0 \le \hat{y} \le 1\) (即二元分类问题)才用 Sigmoid Function,因为\(tanh\)几乎严格优于Sigmoid...

②Rectified Linear Unit(线性整流函数, ReLU):\(Q = max(0,z)\)

​ When not sure what to use for your hidden layer, can use the ReLU function

​ Disadvantage of ReLU: when \(z\) is negative, the value is 0.

​ It can use what names Leaky ReLU to overcome the disadvantage below.

​ Leaky ReLU: \(a = max(0.01z, z)\)

​ ReLU可以使得斜率不变(Sigmoid 和 \(tanh(z)\) 在\(z\rightarrow \infin\)时斜率趋向于0,会使得学习速度下降)

​ 最常用的 Activation Function

③Tannish Function(双曲函数)

当且仅当要解决回归问题的时候,在生成到output layer才使用线性的Activation Function(\(g(z)=z\)) ,比如预测房价时,y不限于 0 和 1(\(y \in \mathbb{R}\)),所以可以用\(g(z)=z\) 输出,隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU

Derivatives of Activation Functions

  • Sigmoid:

    • \(\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))\)

      \(tanh(z)\):
    • \(g\prime(z) = 1-(tanh(z))^2\)
  • ReLU:
    • \(g\prime(z) = \begin{cases}1, \text{if }z\ge0 \\0, \text{if }z\lt0 \end{cases}\)

Gradient Descents For Neural Networks

Parameters : \(w^{[1]},b^{[1]},w^{[2]},b^{[2]}\)

Cost Function : \(J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y},y)\)

Gradient Function:

\[\begin{align}
&\text{Repeat \{}\\
&\quad \text{compute predicts} (\hat{y}^{(i)}, i = 1,\dots,m) \\
&\quad {\rm d}w^{[1]} = \frac{\partial J}{\partial w^{[1]}}, {\rm d}b^{[1]} = \frac{\partial J}{\partial b^{[1]}},\dots\\
&\quad w^{[1]} = w^{[1]} - \alpha {\rm d}w^{[1]}\\
&\quad b^{[1]} = b^{[1]} - \alpha {\rm d}b^{[1]}\\
&\quad w^{[2]} = w^{[2]} - \alpha {\rm d}w^{[2]}\\
&\quad b^{[2]} = b^{[2]} - \alpha {\rm d}b^{[2]}\\
\text{\}}
\end{align}
\]

Forward Propagation :

\[\begin{align}
Z^{[1]} &= w^{[1]}X + b^{[1]}\\
A^{[1]} &= g^{[1]}(z^{[1]})\\
Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} &= g^{[2]}(z^{[2]}) = \sigma(Z^{[2]})
\end{align}
\]

Backward Propagation :

\[\begin{align}
{\rm d}Z^{[2]} &= A^{[2]} - Y, \quad Y = \begin{bmatrix}y^{[1]} & y^{[2]} & \dots & y^{[m]}\end{bmatrix}\\
{\rm d}w^{[2]} &= \frac{1}{m} {\rm d}z^{[2]} A^{[1]T}\\
{\rm d}d^{[2]} &= \frac{1}{m}\text{np.sum(d}z^{[2]}\text{,axis=1,keepdims=True)}\\
{\rm d}Z^{[1]} &= w^{[2]T}{\rm d}Z^{[2]}\; .* \; g^{[1]\prime}(Z^{[1]})\\
{\rm d}w^{[1]} &= \frac{1}{m} {\rm d}Z^{[1]}X^T\\
{\rm d}d^{[1]} &= \frac{1}{m}\text{np.sum(d}z^{[1]}\text{,axis=1,keepdims=True)}\\
\end{align}
\]

注:axis = 1 means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.

又注:\(由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]\prime}(z) = 1-a^2,\;所以 g^{[1]\prime}(Z^{[1]}) = 1-(A^{[1]})^2\), 即:\(Z^{[1]} = w^{[2]T}{\rm d}Z^{[2]}\; .* \; (1-(A^{[1]})^2\)

Random Initialization

For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.

Deep Learning--week1~week3的更多相关文章

  1. Coursera, Deep Learning 1, Neural Networks and Deep Learning - week1, Introduction to deep learning

    整个deep learing 系列课程主要包括哪些内容 Intro to Deep learning

  2. Neural Networks and Deep Learning(week3)Planar data classification with one hidden layer(基于单隐藏层神经网络的平面数据分类)

    Planar data classification with one hidden layer 你会学习到如何: 用单隐层实现一个二分类神经网络 使用一个非线性激励函数,如 tanh 计算交叉熵的损 ...

  3. 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记

    Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...

  4. Coursera, Deep Learning 4, Convolutional Neural Networks - week1

    CNN 主要解决 computer vision 问题,同时解决input X 维度太大的问题. Edge detection 下面演示了convolution 的概念 下图的 vertical ed ...

  5. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Gradient Checking)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Gradient Checking Welcome to the final assignment for this week! In ...

  6. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Regularization)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Regularization Welcome to the second assignment of this week. Deep ...

  7. Deep learning:五十一(CNN的反向求导及练习)

    前言: CNN作为DL中最成功的模型之一,有必要对其更进一步研究它.虽然在前面的博文Stacked CNN简单介绍中有大概介绍过CNN的使用,不过那是有个前提的:CNN中的参数必须已提前学习好.而本文 ...

  8. 【深度学习Deep Learning】资料大全

    最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books  by Yoshua Bengio, Ian Goodfellow and Aaron C ...

  9. 《Neural Network and Deep Learning》_chapter4

    <Neural Network and Deep Learning>_chapter4: A visual proof that neural nets can compute any f ...

  10. Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN

    http://m.blog.csdn.net/blog/wu010555688/24487301 本文整理了网上几位大牛的博客,详细地讲解了CNN的基础结构与核心思想,欢迎交流. [1]Deep le ...

随机推荐

  1. HTML技巧篇:如何让单行文本以及多行文本溢出时显示省略号(…)

    参考:https://baijiahao.baidu.com/s?id=1621362934713048315&wfr=spider&for=pc 本篇文章主要给大家介绍一下在html ...

  2. php协议流

    文件包含漏洞结合php协议流的特性,使得漏洞利用效率更高,下面的内容主要讲解协议流的使用. 0x00 测试环境: php版本: 5.2,5.3,5.5,7.0等web服务: apache2OS系统: ...

  3. hexo建站报错解决记录

    安装某主题依赖 nodejieba 库,该库又依赖 windows-build-tools 和 node-gyp git bash shell 下 cnpm install -g windows-bu ...

  4. luogu4643 [国家集训队]阿狸和桃子的游戏

    题目链接:洛谷 这道题乍一看非常的难,而且题目标题上的标签让人很害怕. 但其实这道题并不难写(只要想到了...emm) 因为我们只需要知道两个人得分之差,所以我们可以对条件进行变换. 我们将边权平分到 ...

  5. python tkinter Text

    """小白随笔,大佬勿喷""" '''tkinter —— text''' '''可选参数有: background(bg) 文本框背景色: ...

  6. 微信小程序wx.uploadFile 上传文件 的两个坑

    fileUpload: function (tempFilePath) { var that = this;//坑1: this需要这么处理 wx.uploadFile({ url: url地址, / ...

  7. xtrabackup工具备份与恢复

    1.xtrabackup简介 Xtrabackup是一个对InnoDB做数据备份的工具,支持在线热备份(备份时不影响数据读写),是商业备份工具InnoDB Hotbackup的一个很好的替代品.它能对 ...

  8. 连接oracle数据库

    一.连接oracle数据库 一.windows环境 oracle windows客户端下载地址:http://www.oracle.com/technetwork/topics/winx64soft- ...

  9. 60道Python面试题&答案精选!找工作前必看

    需要Word/ PDF版本的同学可以在实验楼微信公众号回复关键词"面试题"获取. 1. Python 的特点和优点是什么? 答案:略. 2. 什么是lambda函数?它有什么好处? ...

  10. Shell脚本创建的文件夹末尾有两个问号怎么回事?

    原因:Linux系统的换行符是"\r\n",Windows上的换行符是"\n",Windows上编写shell脚本上传Linux,Linux无法正确识别&quo ...