Deep Networks : Overview
Overview
In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer. While fairly effective for MNIST, this 3-layer model is a fairly shallow network; by this, we mean that the features (hidden layer activations a(2)) are computed using only "one layer" of computation (the hidden layer).
In this section, we begin to discuss deep neural networks, meaning ones in which we have multiple hidden layers; this will allow us to compute much more complex features of the input. Because each hidden layer computes a non-linear transformation of the previous layer, a deep network can have significantly greater representational power (i.e., can learn significantly more complex functions) than a shallow one.
Note that when training a deep network, it is important to use a non-linear activation function
in each hidden layer. This is because multiple layers of linear functions would itself compute only a linear function of the input (i.e., composing multiple linear functions together results in just another linear function), and thus be no more expressive than using just a single layer of hidden units.
Advantages of deep networks
The primary advantage is that it can compactly represent a significantly larger set of fuctions than shallow networks. Formally, one can show that there are functions which a k-layer network can represent compactly (with a number of hidden units that is polynomial in the number of inputs), that a (k − 1)-layer network cannot represent unless it has an exponentially large number of hidden units.
By using a deep network, in the case of images, one can also start to learn part-whole decompositions. For example, the first layer might learn to group together pixels in an image in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer might then group together these contours or detect even more complex features.
Finally, cortical computations (in the brain) also have multiple layers of processing. For example, visual images are processed in multiple stages by the brain, by cortical area "V1", followed by cortical area "V2" (a different part of the brain), and so on.
Difficulty of training deep architectures
The main learning algorithm that researchers were using was to randomly initialize the weights of a deep network, and then train it using a labeled training set
using a supervised learning objective, for example by applying gradient descent to try to drive down the training error. However, this usually did not work well. There were several reasons for this.
Availability of data
Local optima
Diffusion of gradients
when using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminish in magnitude as the depth of the network increases. As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small.(深度神经网络的前几层) Thus, when using gradient descent, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. This problem is often called the "diffusion of gradients."
A closely related problem to the diffusion of gradients is that if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the labeled data alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialized ends up giving similar performance to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers).
Greedy layer-wise training
the main idea is to train the layers of the network one at a time, so that we first train a network with 1 hidden layer, and only after that is done, train a network with 2 hidden layers, and so on. At each step, we take the old network with k − 1 hidden layers, and add an additional k-th hidden layer (that takes as input the previous hidden layer k − 1 that we had just trained). Training can either be supervised (say, with classification error as the objective function on each step), but more frequently it is unsupervised (as in an autoencoder; details to provided later). The weights from training the layers individually are then used to initialize the weights in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e., trained together to optimize the labeled training set error).
Availability of data
While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful. The promise of self-taught learning is that by exploiting the massive amount of unlabeled data, we can learn much better models. By using unlabeled data to learn a good initial value for the weights in all the layers
(except for the final classification layer that maps to the outputs/predictions), our algorithm is able to learn and discover patterns from massively more amounts of data than purely supervised approaches. This often results in much better classifiers being learned.
Better local optima
After having trained the network on the unlabeled data, the weights are now starting at a better location in parameter space than if they had been randomly initialized. We can then further fine-tune the weights starting from this location. Empirically, it turns out that gradient descent from this location is much more likely to lead to a good local minimum, because the unlabeled data has already provided a significant amount of "prior" information about what patterns there are in the input data.
Deep Networks : Overview的更多相关文章
- Deep Learning 8_深度学习UFLDL教程:Stacked Autocoders and Implement deep networks for digit classification_Exercise(斯坦福大学深度学习教程)
前言 1.理论知识:UFLDL教程.Deep learning:十六(deep networks) 2.实验环境:win7, matlab2015b,16G内存,2T硬盘 3.实验内容:Exercis ...
- Initialization of deep networks
Initialization of deep networks 24 Feb 2015Gustav Larsson As we all know, the solution to a non-conv ...
- 基于pytorch实现HighWay Networks之Train Deep Networks
(一)Highway Networks 与 Deep Networks 的关系 理论实践表明神经网络的深度是至关重要的,深层神经网络在很多方面都已经取得了很好的效果,例如,在1000-class Im ...
- 论文笔记:SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks
SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks 2019-04-02 12:44:36 Paper:ht ...
- 论文笔记:Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ICML 2017 Paper:https://arxiv.org/ ...
- 【DeepLearning】Exercise: Implement deep networks for digit classification
Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...
- 深度学习材料:从感知机到深度网络A Deep Learning Tutorial: From Perceptrons to Deep Networks
In recent years, there’s been a resurgence in the field of Artificial Intelligence. It’s spread beyo ...
- Deep Networks for Image Super-Resolution with Sparse Prior
深度学习中潜藏的稀疏表达 Deep Networks for Image Super-Resolution with Sparse Prior http://www.ifp.illinois.edu/ ...
- Training Very Deep Networks
Rupesh Kumar SrivastavaKlaus Greff ̈J urgenSchmidhuberThe Swiss AI Lab IDSIA / USI / SUPSI{rupesh, k ...
随机推荐
- mysql语句判断一天操作记录的个数
话说有一文章表article,存储文章的添加文章的时间是add_time字段,该字段为int(5)类型的,现需要查询今天添加的文章总数并且按照时间从大到小排序,则查询语句如下: 1 select ...
- The view 'Index' or its master was not found or no view engine supports the
ASP.net MVC 5 WebApi部署IIS提示: 未找到视图“索引”或其母版视图,或没有视图引擎支持搜索的位置.搜索了以下位置: 其他设置一切正常 这种情况很有可能是,1.部署的路径中空格 ...
- php修改限制上传文件大小
win下: 1.编辑 php.ini:修改在 php5 下文件大小的限制 找到:file_uploads=On 允许 HTTP 文件上传 找到:max_execution_t ...
- MySql系列之多表查询
多表连接查询 #重点:外链接语法 SELECT 字段列表 FROM 表1 INNER|LEFT|RIGHT JOIN 表2 ON 表1.字段 = 表2.字段; 交叉连接:不适用任何匹配条件.生成笛卡尔 ...
- P3168 [CQOI2015]任务查询系统(主席树)
题目描述 最近实验室正在为其管理的超级计算机编制一套任务管理系统,而你被安排完成其中的查询部分.超级计算机中的任务用三元组(Si,Ei,Pi)描述,(Si,Ei,Pi)表示任务从第Si秒开始,在第Ei ...
- Android通过XML来定义Menu
直接在代码中添加菜单项,给菜单项分组等,这是比较传统的做法,它存在着一些不足.比如说,为了响应每个菜单项,我们需要用常量来保存每个菜单项的ID等.为此,Android提供了一种更好的方式,就是把men ...
- 洛谷——P3368 【模板】树状数组 2
https://www.luogu.org/problem/show?pid=3368 题目描述 如题,已知一个数列,你需要进行下面两种操作: 1.将某区间每一个数数加上x 2.求出某一个数的和 输入 ...
- 二 JDK + mysql + yum + rpm
如果系统环境崩溃. 调用/usr/bin/vim /etc/profile 1 网络搭建 2 host配置 3 SSH无密码登录 4 rpm 安装 yum install ...
- git commit template
https://www.zhihu.com/question/27462267/answer/204658544 https://gist.github.com/adeekshith/cd4c95a0 ...
- springboot 使用mybatis 通用Mapper,pagehelper
首先需要maven导入需要的包,这里用的是sqlserver,druid,jtds连接数据库 <dependency> <groupId>com.alibaba</gro ...
in each hidden layer. This is because multiple layers of linear functions would itself compute only a linear function of the input (i.e., composing multiple linear functions together results in just another linear function), and thus be no more expressive than using just a single layer of hidden units.
using a supervised learning objective, for example by applying gradient descent to try to drive down the training error. However, this usually did not work well. There were several reasons for this.
(except for the final classification layer that maps to the outputs/predictions), our algorithm is able to learn and discover patterns from massively more amounts of data than purely supervised approaches. This often results in much better classifiers being learned.