1. numpy中的几种矩阵相乘:


# x1: axn, x2:nxb np.dot(x1, x2): axn * nxb np.outer(x1, x2): nx1*1xn # 实质为: np.ravel(x1)*np.ravel(x2) np.multiply(x1, x2): [[x1[0][0]*x2[0][0], x1[0][1]*x2[0][1], ...]

2. Bugs' hometown

Many software bugs in deep learning come from having matrix/vector dimensions that don't fit. If you can keep your matrix/vector dimensions straight you will go a long way toward eliminating many bugs.

3. Common steps for pre-processing a new dataset are:

- Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, ...)
- Reshape the datasets such that each example is now a vector of size (num_px \* num_px \* 3, 1)
- "Standardize" the data

4. Unstructured data:

Unstructured data is a generic label for describing data that is not contained in a database or some other type of data structure . Unstructured data can be textual or non-textual. Textual unstructured data is generated in media like email messages, PowerPoint presentations, Word documents, collaboration software and instant messages. Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files

5. Chapter of Activation Function:

  • Choice of activation function:

    1. If output is either 0 or 1 -- sigmoid for the output layer and the other units on ReLU.
    2. Except for the output layer, tanh does better than sigmoid.
    3. ReLU ---level up--> leaky ReLU.
  • Why are ReLU and leaky ReLU often superior to sigmoid and tanh?

    -- The derivatives of the former ones is much bigger than 0, so the learning would be much faster.

  • A linear hidden layer is more or less useless, yet the activation function is a exception.

6. Regularization:

Initially, \(J(w, b) = \frac{1}{m} * \sum_{i=1}^{m}{L({\hat{Y}^(i), y^{(i)}}) + \frac{\lambda}{2*m}||w||_2^2}\)

  1. L2 regularization: \(\frac{\lambda}{2*m}\sum_{j=1}^{n_x}||w_j||^2 = \frac{\lambda}{2*m}||w||_2\)

    One aspect that tanh is better thatn sigmoid(in terms of regularization) -- When x is very close to 0, the derivative of tanh(x) is almost linear, while that of the sigmoid(x) is alomst 0.

  2. Dropout:

     Method: Make certain values of weights be zeros randomly, just like -- W= np.multiply(W, C), where C is a 0-1 array.

     Matters need attention: Don't use dropout in test procedure -- Time costly, result randomly.

     Work principle:

       Intuition: Can't rely on any one feature, so have to spread out weights(shrinking weights).

       Besides, you can set different rates of "Dropout", like lower ones on more complex layer, which are called "key prop".

  1. Data augmentation:

     Do some operation on your data images, such as flipping, rotation, zooming, etc, without changing their labels, in order to prevent from over-fitting on some aspects, such as the direction of faces, the size of cats.

  1. Early stopping.

7. Solution to "gradient vanishing or exploding":

     Set WL = np.random.randn(shape) * np.sqrt(\(\frac{2}{n^{[L-1]}}\)) if activation_function == "ReLU"

       else: np.random.randn(shape) * np.sqrt(\(\frac{1}{n^{[L-1]}}\)) or np.sqrt(\(\sqrt{\frac{2}{n^{[L-1]}+n^{[L]}}}\))(Xavier initialization)

8. Gradient Checking:

  1. for i in range(len(\(\theta\))):

    to check if (d\(\theta_{approx}[i] = \frac{J(\theta_1, \theta_2, ..., \theta_i+\epsilon, ...) - J(\theta_1, \theta_2, ..., \theta_i-\epsilon, ...)}{2\epsilon}\)) ?= \(d\theta[i] = \frac{\partial{J}}{\partial{\theta_i}}\)

             <==> \(d\theta_{approx} ?= d\theta\)

             <==> \(\frac{||d\theta_{approx} - d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2}\) in an accent range: \(10^{-7}\) is great, and \(10^{-3}\) is wrong.

  2. Tips:

    • Only to debug, instead of training.
    • If algorithm fails grad check, look at components(\(db^{[L]}, dw^{[L]}\)) to try to identify bug.
    • Remember regularization.
    • Doesn't work together with dropout.
    • Run at random initialization; perhaps again after some training.

9. Exponentially weighted averages:

  Definition: let \(V_{t} = {\beta}V_{t-1} + (1 - \beta)\theta_t\) (_V_s are the averages, and the _\(\theta\)_s are the initial discrete data).

and \(V_{t} = \frac{V_{t}}{1 - {\beta}^t}\) (To correct initial bias).

  Usage: when it comes to this situation:

Since the average of the distance vertical movement is almost zeros, you can use EWA to average it, prevent it from divergence.

  On iteration t:

    Compute dW on the current mini-batch

    \(v_{dW} = {\beta}v_{dW} + (1 - \beta)dW\)

    \(v_{db} = {\beta}v_{db} + (1 - \beta)db\)

    \(W = W - {\alpha}v_{dW}, b = b - {\alpha}v_{db}\)

    Hyperparameters: \(\alpha\), \({\beta}(=0.9)\)

Deep Learning Specialization 笔记的更多相关文章

  1. Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现(转)

    Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文, ...

  2. Deep Learning论文笔记之(八)Deep Learning最新综述

    Deep Learning论文笔记之(八)Deep Learning最新综述 zouxy09@qq.com http://blog.csdn.net/zouxy09 自己平时看了一些论文,但老感觉看完 ...

  3. Deep Learning论文笔记之(六)Multi-Stage多级架构分析

    Deep Learning论文笔记之(六)Multi-Stage多级架构分析 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些 ...

  4. 【deep learning学习笔记】注释yusugomori的DA代码 --- dA.h

    DA就是“Denoising Autoencoders”的缩写.继续给yusugomori做注释,边注释边学习.看了一些DA的材料,基本上都在前面“转载”了.学习中间总有个疑问:DA和RBM到底啥区别 ...

  5. Deep Learning论文笔记之(一)K-means特征学习

    Deep Learning论文笔记之(一)K-means特征学习 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文,但老感 ...

  6. Deep Learning论文笔记之(三)单层非监督学习网络分析

    Deep Learning论文笔记之(三)单层非监督学习网络分析 zouxy09@qq.com http://blog.csdn.net/zouxy09          自己平时看了一些论文,但老感 ...

  7. Spectral Norm Regularization for Improving the Generalizability of Deep Learning论文笔记

    Spectral Norm Regularization for Improving the Generalizability of Deep Learning论文笔记 2018年12月03日 00: ...

  8. Deep Learning论文笔记之(四)CNN卷积神经网络推导和实现

    https://blog.csdn.net/zouxy09/article/details/9993371 自己平时看了一些论文,但老感觉看完过后就会慢慢的淡忘,某一天重新拾起来的时候又好像没有看过一 ...

  9. [置顶] Deep Learning 学习笔记

    一.文章来由 好久没写原创博客了,一直处于学习新知识的阶段.来新加坡也有一个星期,搞定签证.入学等杂事之后,今天上午与导师确定了接下来的研究任务,我平时基本也是把博客当作联机版的云笔记~~如果有写的不 ...

随机推荐

  1. 【中文】【deplearning.ai】【吴恩达课后作业目录】

    [目录][吴恩达课后作业目录] 吴恩达深度学习相关资源下载地址(蓝奏云) 课程 周数 名称 类型 语言 地址 课程1 - 神经网络和深度学习 第1周 深度学习简介 测验 中英 传送门 无编程作业 编程 ...

  2. docker 运行时常见错误

    docker 运行时常见错误 (1) Cannot connect to the Docker daemon at unix:///var/run/docker.sock. [root@localho ...

  3. token的分层图如下

    基于 token 的多平台身份认证架构设

  4. Smarty 3.1.34 反序列化POP链(任意文件删除)

    Smarty <= 3.1.34,存在任意文件删除的POP链. Exp: <?php class Smarty_Internal_Template { public $smarty = n ...

  5. 分布式缓存 — kafka

    Kafka是一个分布式.支持分区的(partition).多副本的(replica),基于zookeeper协调的分布式消息系统,它的最大的特性就是可以实时的处理大量数据以满足各种需求场景:比如基于h ...

  6. spark SQL(三)数据源 Data Source----通用的数据 加载/保存功能

    Spark SQL 的数据源------通用的数据 加载/保存功能 Spark SQL支持通过DataFrame接口在各种数据源上进行操作.DataFrame可以使用关系变换进行操作,也可以用来创建临 ...

  7. hadoop知识点总结(三)YARN设计理念及基本架构

    YARN设计理念与基本架构 1,MRv1的局限性:扩展性差,可靠性差,资源利用率低,无法支持多种计算框架 2,YARN基本设计思想 1)基本框架对比 Hadoop1.0中,JobTracker由资源管 ...

  8. 深复制VS浅复制(MemberwiseClone方法介绍)

    MemberwiseClone方法,属于命名空间System,存在于程序集 mscorlib.dll中.返回值是System.Object.其含义是:创建一个当前object对象的浅表副本. MSDN ...

  9. PHP-文件、目录相关操作

    PHP-文件.目录相关操作 一  目录操作(Directory 函数允许获得关于目录及其内容的信息) 相关函数: 函数 描述 chdir() 改变当前的目录. chroot() 改变根目录. clos ...

  10. 我用了半年的时间,把python学到了能出书的程度

    Python难学吗?不难,我边做项目边学,过了半年就通过了出版社编辑的面试,接到了一本Python选题,并成功出版. 有同学会说,你有编程基础外带项目实践机会,所以学得快.这话不假,我之前的基础确实加 ...