https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Deep neural nets with a large number of parameters are very powerful machine learning
systems. However, overfitting is a serious problem in such networks. Large networks are also
slow to use, making it difficult to deal with overfitting by combining the predictions of many
different large neural nets at test time. Dropout is a technique for addressing this problem.
The key idea is to randomly drop units (along with their connections) from the neural
network during training. This prevents units from co-adapting too much. During training,
dropout samples from an exponential number of different “thinned” networks. At test time,
it is easy to approximate the effect of averaging the predictions of all these thinned networks
by simply using a single unthinned network that has smaller weights. This significantly
reduces overfitting and gives major improvements over other regularization methods. We
show that dropout improves the performance of neural networks on supervised learning
tasks in vision, speech recognition, document classification and computational biology,
obtaining state-of-the-art results on many benchmark data sets.
 
 
Deep neural networks contain multiple non-linear hidden layers and this makes them very
expressive models that can learn very complicated relationships between their inputs and
outputs. With limited training data, however, many of these complicated relationships
will be the result of sampling noise, so they will exist in the training set but not in real
test data even if it is drawn from the same distribution. This leads to overfitting and many
methods have been developed for reducing it. These include stopping the training as soon as
performance on a validation set starts to get worse, introducing weight penalties of various
kinds such as L1 and L2 regularization and soft weight sharing (Nowlan and Hinton, 1992).
With unlimited computation, the best way to “regularize” a fixed-sized model is to
average the predictions of all possible settings of the parameters, weighting each setting by
 
 
 
 
Figure 1:
Dropout Neural Net Model.
Left
: A standard neural net with 2 hidden layers.
Right
:
An example of a thinned net produced by applying dropout to the network on the left.
Crossed units have been dropped.
 
 
its posterior probability given the training data. This can sometimes be approximated quite
well for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but we
would like to approach the performance of the Bayesian gold standard using considerably
less computation. We propose to do this by approximating an equally weighted geometric
mean of the predictions of an exponential number of learned models that share parameters.
 
Model combination nearly always improves the performance of machine learning meth-
ods. With large neural networks, however, the obvious idea of averaging the outputs of
many separately trained nets is prohibitively expensive. Combining several models is most
helpful when the individual models are different from each other and in order to make
neural net models different, they should either have different architectures or be trained
on different data. Training many different architectures is hard because finding optimal
hyperparameters for each architecture is a daunting task and training each large network
requires a lot of computation. Moreover, large networks normally require large amounts of
training data and there may not be enough data available to train different networks on
different subsets of the data. Even if one was able to train many different large networks,
using them all at test time is infeasible in applications where it is important to respond
quickly.
 
Dropout is a technique that addresses both these issues. It prevents overfitting and
provides a way of approximately combining exponentially many different neural network
architectures efficiently. The term “dropout” refers to dropping out units (hidden and
visible) in a neural network. By dropping a unit out, we mean temporarily removing it from
the network, along with all its incoming and outgoing connections, as shown in Figure 1.
The choice of which units to drop is random. In the simplest case, each unit is retained with
a fixed probability p independent of other units, where p can be chosen using a validation
set or can simply be set at 0.5, which seems to be close to optimal for a wide range of
networks and tasks. For the input units, however, the optimal probability of retention is
usually closer to 1 than to 0.5.
 
Figure 2:
Left
: A unit at training time that is present with probability
p
and is connected to units
in the next layer with weights
w
.
Right
: At test time, the unit is always present and
the weights are multiplied by
p
. The output at test time is same as the expected output
at training time.
 
 
Applying dropout to a neural network amounts to sampling a “thinned” network from
it. The thinned network consists of all the units that survived dropout (Figure 1b). A
neural net with
n
units, can be seen as a collection of 2
n
possible thinned neural networks.
These networks all share weights so that the total number of parameters is still
O
(
n
2
), or
less. For each presentation of each training case, a new thinned network is sampled and
trained. So training a neural network with dropout can be seen as training a collection of 2
n
thinned networks with extensive weight sharing, where each thinned network gets trained
very rarely, if at all.
At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability
p
during training, the outgoing weights of that unit are multiplied by
p
at test
time as shown in Figure 2. This ensures that for any hidden unit the
expected
output (under
the distribution used to drop units at training time) is the same as the actual output at
test time. By doing this scaling, 2
n
networks with shared weights can be combined into
a single neural network to be used at test time. We found that training a network with
dropout and using this approximate averaging method at test time leads to significantly
lower generalization error on a wide variety of classification problems compared to training
with other regularization methods.
The idea of dropout is not limited to feed-forward neural nets. It can be more generally
applied to graphical models such as Boltzmann Machines. In this paper, we introduce
the dropout Restricted Boltzmann Machine model and compare it to standard Restricted
Boltzmann Machines (RBM). Our experiments show that dropout RBMs are better than
standard RBMs in certain respects..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Dropout: A Simple Way to Prevent Neural Networks fromOverfitting的更多相关文章

  1. Dropout 下(关于《Dropout: A Simple way to prevent neural networks from overfitting》)

    先上菜单: 摘要: Deep neural nets with a large number of parameters are very powerful machine learning syst ...

  2. Deep Learning 23:dropout理解_之读论文“Improving neural networks by preventing co-adaptation of feature detectors”

    理论知识:Deep learning:四十一(Dropout简单理解).深度学习(二十二)Dropout浅层理解与实现.“Improving neural networks by preventing ...

  3. Must Know Tips/Tricks in Deep Neural Networks

    Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)   Deep Neural Networks, especially C ...

  4. Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)

    http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html Deep Neural Networks, especially Conv ...

  5. Deep learning_CNN_Review:A Survey of the Recent Architectures of Deep Convolutional Neural Networks——2019

    CNN综述文章 的翻译 [2019 CVPR] A Survey of the Recent Architectures of Deep Convolutional Neural Networks 翻 ...

  6. Attention and Augmented Recurrent Neural Networks

    Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain SHAN CARTERGoogle Brain Sep ...

  7. 论文笔记系列-Simple And Efficient Architecture Search For Neural Networks

    摘要 本文提出了一种新方法,可以基于简单的爬山过程自动搜索性能良好的CNN架构,该算法运算符应用网络态射,然后通过余弦退火进行短期优化运行. 令人惊讶的是,这种简单的方法产生了有竞争力的结果,尽管只需 ...

  8. PyNest——Part1:neurons and simple neural networks

    neurons and simple neural networks pynest – nest模拟器的界面 神经模拟工具(NEST:www.nest-initiative.org)专为仿真点神经元的 ...

  9. DeepFool: a simple and accurate method to fool deep neural networks

    目录 概 主要内容 二分类模型 为线性 为一般二分类 多分类问题 仿射 为一般多分类 Moosavidezfooli S, Fawzi A, Frossard P, et al. DeepFool: ...

随机推荐

  1. JVM指令详解(下)

    九.自增减指令 该指令用于对本地(局部)变量进行自增减操作.该指令第一参数为本地变量的编号,第二个参数为自增减的数量. 比如对于代码:                 int d=10; d++; d ...

  2. 43深入理解C指针之---指针与树

    一.size_t:用于安全表示长度,所有平台和系统都会解析成自己对应的长度 1.定义:size_t类型表示C中任何对象所能表示的最大长度,是个无符号整数:常常定义在stdio.h或stdlib.h中 ...

  3. dedecms--会员信息导出excel表格

    1:在dede/templets下面的member_main.htm,在全选按钮那里添加一个导出excel按钮:代码如下: <a href="toexcel.php" cla ...

  4. LeetCode OJ--Set Matrix Zeroes **

    http://oj.leetcode.com/problems/set-matrix-zeroes/ 因为空间要求原地,所以一些信息就得原地存储.使用第一行第一列来存本行本列中是否有0.另外对于第一个 ...

  5. Ui大屏

    http://www.uimaker.com/plus/view.php?aid=128661&pageno=1

  6. 微信关注事件bug记录

    年前研究了一下微信带参数的二维码,处理邀请注册成会员等的方式 通过带参数的二维码触发微信的 subscribe(订阅) 或者 SCAN  (已经订阅后)事件,然后抓取eventKey(记录邀请人的信息 ...

  7. L2-3. 悄悄关注【STL+结构体排序】

    L2-3. 悄悄关注 时间限制 150 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 陈越 新浪微博上有个“悄悄关注”,一个用户悄悄关注的人,不出现在 ...

  8. CentOS6.7源码安装MySQL5.6

    1.源码安装MySQL5.6 # CentOS6操作系统安装完成后,默认会在/etc目录下存在一个my.cnf, # 强制卸载了mysql-libs之后,my.cnf就会消失 rpm -qa | gr ...

  9. Android硬件抽象层(HAL)深入剖析(二)

    上一篇我们分析了android HAL层的主要的两个结构体hw_module_t(硬件模块)和hw_device_t(硬件设备)的成员,下面我们来具体看看上层app到底是怎么实现操作硬件的? 我们知道 ...

  10. Eclipse 修改字符集

    Eclipse 修改字符集 默认情况下 Eclipse 字符集为 GBK,但现在很多项目采用的是 UTF-8,这是我们就需要设置我们的 Eclipse 开发环境字符集为 UTF-8, 设置步骤如下: ...