week1

一张图片,设像素为64*64, 颜色通道为红蓝绿三通道,则对应3个64*64实数矩阵

为了用向量表示这些矩阵,将这些矩阵的像素值展开为一个向量x作为算法的输入

从红色到绿色再到蓝色,依次按行一个个将元素读到向量x中,则x是一个\(1\times64*64*3\)的矩阵,也就是一个64*64*3维的向量

用 \(n_x = 64*64*3\) 表示特征向量x的维度

而所有的训练样本表示成:\(X = \begin{bmatrix}\mid & \mid &\mid &&\mid \\ x^{(1)}& x^{(2)}& x^{(3)}& \cdots & x^{(m)}\\ \mid & \mid &\mid &&\mid \end{bmatrix}\) (\(n_x \times m\)矩阵)

注意不是\(X = \begin{bmatrix} (x^{(1)})^T\\ \vdots \\ (x^{(m)})^T \end{bmatrix}\) ,用上面的方法运算会简单点)

\(Y=\begin{bmatrix}y^{(1)} & y^{(2)} & \cdots & y^{(m)}\end{bmatrix}\)

之前的机器学习课上的\(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)的形式不再使用,而用\(\large b = \theta_0, \; w = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_{n_x} \\ \end{bmatrix}\)代替( it will be easier to just keep \(b\) and \(w\) as separate parameters )

则output : \(\large \hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b),{\rm where\;}\sigma(z^{(i)}) = \frac{1}{1+e^{-z^{(i)}}}\)

\(\text{Given \{}(x^{(1)}, y^{(1)}),\dots,(x^{(m)},y^{(m)})\text{\}, want } \hat{y}^{(i)} \approx y^{(i)}\)


week2

Loss Function/Error Function

Loss Function/Error Function(误差函数): used to measure how well our algorism is doing

\[{\cal L}(\hat{y},y) = -y\cdot log(\hat{y})-(1-y)\cdot log(1-\hat{y})
\]

Cost Function

\[J(w,b) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,\hat{y}^{(i)})+(1-y^{(i)})\, log\,(1-\hat{y}^{(i)})]
\]

Gradient Descent

​ 看ML的笔记,实质上是一样的

Vectorization:

#Non-vecotrized
#slow
z = 0
for i in range(n_x):
z += w[i] * x[i]
z += b #Vectorized
#import numpy as np
z = np.dot(w,x) + b

whenever possible, avoid explicit for-loops(因为是解释型语言), 用numpy带的行数可以简洁而高效地实现

Vectorizing Logistic Regression

\(X = \begin{bmatrix} \lvert & \lvert & \cdots & \lvert \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ \lvert & \lvert & \cdots & \lvert \end{bmatrix}, \mathbb{R}^{n_x \times m}\)

\(Z = \begin{bmatrix}z^{(1)} & z^{(2)} & \cdots & z^{(m)} \end{bmatrix} = w^TX + \begin{bmatrix}b &b & \cdots & b \end{bmatrix}\)

\(z^{(i)}\) 是 sigmoid function的输入值

\(A = \begin{bmatrix}a^{(1)} & a^{(2)} & \cdots & a^{(m)} \end{bmatrix} = \sigma(Z)\)

(这里的不同上标的元素似乎实际是在同一个layer中的,跟ML课上不大一样。 \(a^{[j](i)}\)中方括号括起来的是层数,圆括号括起来的是第\(i\)个训练实例)

import numpy as np
z = np.dot(w,x) + b\
#Python automatically takes this real number b and expands it out to this 1*m row vector

Gradient Output

\({\rm d}z^{(i)} = a^{(i)} - y^{(i)}\)

\(\begin{align}{\rm d}Z &= \begin{bmatrix}{\rm d}z^{(1)} & {\rm d}z^{(2)} & \cdots & {\rm d}z^{(m)} \end{bmatrix} \\&= A-Y = \begin{bmatrix}a^{(1)} - y^{(1)} & a^{(2)} - y^{(2)} & \cdots & a^{(m)} - y^{(m)} \end{bmatrix} \end{align}\)

${\rm d}b = $1/m*np.sum(dZ)

\({\rm d}w = \frac{1}{m}X{\rm d}Z^T\)

单次迭代免for-loop法(vectorize):

\[\begin{align}
\downarrow&\begin{cases}
Z & = w^T+b\\
& = {\rm np.dot(}w{\rm .T, }X{\rm)}\\
A & = \sigma(Z)\\
{\rm d}Z &= A-Y \\
{\rm d}w &= \frac{1}{m}X{\rm d}Z^T\\
\end{cases}\\\\
w& := w - \alpha{\rm d}w\\
b &:= b - \alpha{\rm d}b
\end{align}
\]

若要多次迭代,最外层的显式for-loop是不可避免的

Broadcasting

reshape()确保矩阵的尺寸

举个例子说明numpy 的 broadcasting机制:

>>> import numpy as np
>>> a = np.arange(0,6).reshape(6,1)
>>> a
array([[0],
[1],
[2],
[3],
[4],
[5]])
>>> b = np.arange(0,5)
>>> b
array([0, 1, 2, 3, 4])
>>> a * b
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16],
[ 0, 5, 10, 15, 20]])
>>> a + b
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])

也就是说matrix+-*/number/vector时,numpy会将number/vector通过自我复制拓展成合法的矩阵

注意这会导致 在期望抛出异常的地方 不抛出异常而是发生奇怪的BUG:

​ 比如 有时我想 行向量和列向量相加时抛出异常, 但是numpy却用broadcasting机制把它给算出来了...

numpy的坑

import numpy as np
a = np.random.randn(5)
>>> a
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
>>> a.shape
(5,)
# which is called a rank 1 array in Python and is neither a row vector nor a column vector >>> a.T
array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])
# which is same as 'a' i self >>> np.dot(a,a.T)
3.2288264718632416
# it is a number rather than a matrix in expectation(just like array([[55]]))

不要使用形如(5,)或者(n,)这样的“rank 1 array”, 而是显式地说明是\(m \times n\)的矩阵:

>>> a = np.random.randn(5,1)
>>> a
array([[ 0.7643396 ],
[-1.66945103],
[ 1.66235712],
[-0.06892102],
[-1.61347409]])
>>> a.T
array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])

注意array([-0.19837642, -0.16758652, 1.57705505, 0.13033745, -0.81073889])array([[ 0.7643396 , -1.66945103, 1.66235712, -0.06892102, -1.61347409]])的区别(后者有两个方括号), 这说明前者是秩为1的数组而后者是一个真正的\(1 \times 5\)矩阵(就像C里一样矩阵是用二维数组表示的)(另外我觉得rank 1 array翻译为一维数组更为准确)

It can use assert() statement to make sure the dimension of one of vectors.

When you get a rank 1 array, you can use a.reshape to transform it into a (n,1) array or a (1,n) array.

Logistic Regression Cost Function

\[\left.
\begin{array}{l}
\text{If y=1:}\quad p(y|x)=\hat{y}\\
\text{If y=0:}\quad p(y|x)=1-\hat{y}
\end{array}
\right\}
p(y|x) = \hat{y}^y\cdot (1-\hat{y})^{1-y}\\
\,\\
\begin{align}
\therefore {\rm log}(p(y|x)) &= y\cdot log\,\hat{y} + (1-y)\cdot log\, (1-\hat{y}) \\
&= -\mathcal{L}(\hat{y},y)
\end{align}
\]

所以:

\[\begin{align}
{\rm log }[p(\text{labels in training set})] &= {\rm log } \prod_{i=1}^mp(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m {\rm log\,}p(y^{(i)}|x^{(i)})\\
&=\sum_{i=1}^m-\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\
&=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\end{align}\\
\text{Cost: }J(w,b) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\]

maximum likelihood estimation (极大似然估计)


week3

\(Z^{[j]} = W^{[j]}A^{[j-1]} + b^{[j]} = w^{[j]}\begin{bmatrix} | & | & | & \\ a^{[j-1](1)} & a^{[j-1](2)} & a^{[j-1](3)} & \cdots \\ | & | & | & \end{bmatrix} + b^{[j]} = \begin{bmatrix} | & | & | & \\ z^{[j](1)} & z^{[j](2)} & z^{[j](3)} & \cdots \\ | & | & | & \end{bmatrix}\)

其中\((i) \in [(1),(m)],\quad [j] \in [[1],[n]],\quad X = A^{[0]}\)

Other Activation Function

①\(tanh(z)\) function:

\[a= tanh(z)=\frac{e^z -e^{-z}}{e^z +e^{-z}}\text{ , when } tanh(z) \in (-1,1), tanh(0)=0
\]

​ \(tanh(z)\) 可以把 数据中心化 为 0 (Sigmoid Function 将数据中心化为 0.5)

​ 之后只有 \(0 \le \hat{y} \le 1\) (即二元分类问题)才用 Sigmoid Function,因为\(tanh\)几乎严格优于Sigmoid...

②Rectified Linear Unit(线性整流函数, ReLU):\(Q = max(0,z)\)

​ When not sure what to use for your hidden layer, can use the ReLU function

​ Disadvantage of ReLU: when \(z\) is negative, the value is 0.

​ It can use what names Leaky ReLU to overcome the disadvantage below.

​ Leaky ReLU: \(a = max(0.01z, z)\)

​ ReLU可以使得斜率不变(Sigmoid 和 \(tanh(z)\) 在\(z\rightarrow \infin\)时斜率趋向于0,会使得学习速度下降)

​ 最常用的 Activation Function

③Tannish Function(双曲函数)

当且仅当要解决回归问题的时候,在生成到output layer才使用线性的Activation Function(\(g(z)=z\)) ,比如预测房价时,y不限于 0 和 1(\(y \in \mathbb{R}\)),所以可以用\(g(z)=z\) 输出,隐藏单元不应该使用Linear Activation Function, 而是应该使用tanh/ReLU/Leaky ReLU

Derivatives of Activation Functions

  • Sigmoid:

    • \(\frac{{\rm d}}{{\rm d}z}g(z) = g(z)(1-g(z))\)

      \(tanh(z)\):
    • \(g\prime(z) = 1-(tanh(z))^2\)
  • ReLU:
    • \(g\prime(z) = \begin{cases}1, \text{if }z\ge0 \\0, \text{if }z\lt0 \end{cases}\)

Gradient Descents For Neural Networks

Parameters : \(w^{[1]},b^{[1]},w^{[2]},b^{[2]}\)

Cost Function : \(J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y},y)\)

Gradient Function:

\[\begin{align}
&\text{Repeat \{}\\
&\quad \text{compute predicts} (\hat{y}^{(i)}, i = 1,\dots,m) \\
&\quad {\rm d}w^{[1]} = \frac{\partial J}{\partial w^{[1]}}, {\rm d}b^{[1]} = \frac{\partial J}{\partial b^{[1]}},\dots\\
&\quad w^{[1]} = w^{[1]} - \alpha {\rm d}w^{[1]}\\
&\quad b^{[1]} = b^{[1]} - \alpha {\rm d}b^{[1]}\\
&\quad w^{[2]} = w^{[2]} - \alpha {\rm d}w^{[2]}\\
&\quad b^{[2]} = b^{[2]} - \alpha {\rm d}b^{[2]}\\
\text{\}}
\end{align}
\]

Forward Propagation :

\[\begin{align}
Z^{[1]} &= w^{[1]}X + b^{[1]}\\
A^{[1]} &= g^{[1]}(z^{[1]})\\
Z^{[2]} &= w^{[2]}A^{[1]} + b^{[2]}\\
A^{[2]} &= g^{[2]}(z^{[2]}) = \sigma(Z^{[2]})
\end{align}
\]

Backward Propagation :

\[\begin{align}
{\rm d}Z^{[2]} &= A^{[2]} - Y, \quad Y = \begin{bmatrix}y^{[1]} & y^{[2]} & \dots & y^{[m]}\end{bmatrix}\\
{\rm d}w^{[2]} &= \frac{1}{m} {\rm d}z^{[2]} A^{[1]T}\\
{\rm d}d^{[2]} &= \frac{1}{m}\text{np.sum(d}z^{[2]}\text{,axis=1,keepdims=True)}\\
{\rm d}Z^{[1]} &= w^{[2]T}{\rm d}Z^{[2]}\; .* \; g^{[1]\prime}(Z^{[1]})\\
{\rm d}w^{[1]} &= \frac{1}{m} {\rm d}Z^{[1]}X^T\\
{\rm d}d^{[1]} &= \frac{1}{m}\text{np.sum(d}z^{[1]}\text{,axis=1,keepdims=True)}\\
\end{align}
\]

注:axis = 1 means summing horizontally, and keepdims = True means prevent from outputting Rank 1 Array. You can call reshape function explicitly rather than keeping these parameters.

又注:\(由于A^{[1]} = g^{[1]}(Z^{[1]})且g^{[1]\prime}(z) = 1-a^2,\;所以 g^{[1]\prime}(Z^{[1]}) = 1-(A^{[1]})^2\), 即:\(Z^{[1]} = w^{[2]T}{\rm d}Z^{[2]}\; .* \; (1-(A^{[1]})^2\)

Random Initialization

For a neural network, if initialize the weights to parameters to all zero and then apply gradient descent, it won't work.

Deep Learning--week1~week3的更多相关文章

  1. Coursera, Deep Learning 1, Neural Networks and Deep Learning - week1, Introduction to deep learning

    整个deep learing 系列课程主要包括哪些内容 Intro to Deep learning

  2. Neural Networks and Deep Learning(week3)Planar data classification with one hidden layer(基于单隐藏层神经网络的平面数据分类)

    Planar data classification with one hidden layer 你会学习到如何: 用单隐层实现一个二分类神经网络 使用一个非线性激励函数,如 tanh 计算交叉熵的损 ...

  3. 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记

    Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...

  4. Coursera, Deep Learning 4, Convolutional Neural Networks - week1

    CNN 主要解决 computer vision 问题,同时解决input X 维度太大的问题. Edge detection 下面演示了convolution 的概念 下图的 vertical ed ...

  5. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Gradient Checking)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Gradient Checking Welcome to the final assignment for this week! In ...

  6. Coursera Deep Learning 2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1, Assignment(Regularization)

    声明:所有内容来自coursera,作为个人学习笔记记录在这里. Regularization Welcome to the second assignment of this week. Deep ...

  7. Deep learning:五十一(CNN的反向求导及练习)

    前言: CNN作为DL中最成功的模型之一,有必要对其更进一步研究它.虽然在前面的博文Stacked CNN简单介绍中有大概介绍过CNN的使用,不过那是有个前提的:CNN中的参数必须已提前学习好.而本文 ...

  8. 【深度学习Deep Learning】资料大全

    最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books  by Yoshua Bengio, Ian Goodfellow and Aaron C ...

  9. 《Neural Network and Deep Learning》_chapter4

    <Neural Network and Deep Learning>_chapter4: A visual proof that neural nets can compute any f ...

  10. Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN

    http://m.blog.csdn.net/blog/wu010555688/24487301 本文整理了网上几位大牛的博客,详细地讲解了CNN的基础结构与核心思想,欢迎交流. [1]Deep le ...

随机推荐

  1. 【托业】【全真题库】TEST1-语法题

    TEST01 103. delivery date 交货日期 delivery n.传送,投递; [法](正式)交付; 分娩; 讲演; 104. net revenue 净收入,纯收入 105. re ...

  2. MySQL主从复制延迟的问题 #M1002#

    MySQL主从复制延迟的问题 #M1002# https://mp.weixin.qq.com/s/NwFGER-qn2xQ5TnG-php1Q 更为糟糕的是,MySQL主从复制在大事务下的延迟.同样 ...

  3. Python3学习之路~9.1 paramiko模块:实现ssh执行命令以及传输文件

    我们一般使用linux的时候,都是在Windows上安装一个ssh客户端连接上去.那么从一台linux如何连接到另一条linux呢?使用ssh命令即可,因为每台linux机器自己都有一个ssh客户端. ...

  4. levmar : Levenberg-Marquardt库编译

    levmar : Levenberg-Marquardt 是非线性优化的一个库 1.使用CMake生成sln项目,编译 clapack库 在levmar工程中,打开misc.c文件,在最开始添加#in ...

  5. (转)测试如何区别是前端的问题还是后台的bug

    常常说到的一个IT项目,包括前端开发,后台开发,软件测试,架构,项目经理,产品需求.那么对于一位优秀的软件测试工程师来说,需要区分前端和后台的工作就显得尤为重要. - 什么是前端和后台 简而言之,前端 ...

  6. 配置成功java11后安装eclipse失败

    前提是 1.java是成功配置的, 2.看清楚32bit,还是64bit,需要一致 THEN 方法一:去安装java11之前的版本,正确配置环境 方法二:java11中没有jre(不打紧).所以需要直 ...

  7. Python--Virtualenv简明教程(转载https://www.jianshu.com/p/08c657bd34f1)

    virtualenv is a tool to create isolated Python environments. virtualenv通过创建独立Python开发环境的工具, 来解决依赖.版本 ...

  8. sql 查询所有表以及表结构

    查询数据库中所有的表: as statu from [sysobjects] where [type] = 'u' order by [name] 查询表结构: --查询表结构 ALTER PROCE ...

  9. Oracle 10g 使用REGEXP_SUBSTR 分拆字符串 (转)

    SELECT l_count, REGEXP_SUBSTR('add, daddf, dsdf, asdfa, dsfasd, dsfad','[^,]+',1,l_count) AS NAME  F ...

  10. 2018软件工程W班第一次助教小结

    我是数计学院实验教学中心的一名老师,机缘巧合之下,这个学期跟着汪老师上<软件工程实践>这门课.之前有陆续听说过<构建之法>这本书,记得好像学院还有主办过研讨会.对于这门实践课, ...