Initialization of deep networks

24 Feb 2015Gustav Larsson

As we all know, the solution to a non-convex optimization algorithm (like stochastic gradient descent) depends on the initial values of the parameters. This post is about choosing initialization parameters for deep networks and how it affects the convergence. We will also discuss the related topic of vanishing gradients.

First, let's go back to the time of sigmoidal activation functions and initialization of parameters using IID Gaussian or uniform distributions with fairly arbitrarily set variances. Building deep networks was difficult because of exploding or vanishing activations and gradients. Let's take activations first: If all your parameters are too small, the variance of your activations will drop in each layer. This is a problem if your activation function is sigmoidal, since it is approximately linear close to 0. That is, you gradually lose your non-linearity, which means there is no benefit to having multiple layers. If, on the other hand, your activations become larger and larger, then your activations will saturate and become meaningless, with gradients approaching 0.

Let us consider one layer and forget about the bias. Note that the following analysis and conclussion is taken from Glorot and Bengio[1]. Consider a weight matrix W∈Rm×n, where each element was drawn from an IID Guassian with variance Var(W). Note that we are a bit abusive with notation letting W denote both a matrix and a univariate random variable. We also assume there is no correlation between our input and our weights and both are zero-mean. If we consider one filter (row) in W, say w (a random vector), then the variance of the output signal over the input signal is:

Var(wTx)Var(X)=∑NnVar(wnxn)Var(X)=nVar(W)Var(X)Var(X)=nVar(W)

As we build a deep network, we want the variance of the signal going forward in the network to remain the same, thus it would be advantageous if nVar(W)=1. The same argument can be made for the gradients, the signal going backward in the network, and the conclusion is that we would also like mVar(W)=1. Unless n=m, it is impossible to sastify both of these conditions. In practice, it works well if both are approximately satisfied. One thing that has never been clear to me is why it is only necessary to satisfy these conditions when picking the initialization values of W. It would seem that we have no guarantee that the conditions will remain true as the network is trained.

Nevertheless, this Xavier initialization (after Glorot's first name) is a neat trick that works well in practice. However, along came rectified linear units (ReLU), a non-linearity that is scale-invariant around 0 and does not saturate at large input values. This seemingly solved both of the problems the sigmoid function had; or were they just alleviated? I am unsure of how widely used Xavier initialization is, but if it is not, perhaps it is because ReLU seemingly eliminated this problem.

However, take the most competative network as of recently, VGG[2]. They do not use this kind of initialization, although they report that it was tricky to get their networks to converge. They say that they first trained their most shallow architecture and then used that to help initialize the second one, and so forth. They presented 6 networks, so it seems like an awfully complicated training process to get to the deepest one.

A recent paper by He et al.[3] presents a pretty straightforward generalization of ReLU and Leaky ReLU. What is more interesting is their emphasis on the benefits of Xavier initialization even for ReLU. They re-did the derivations for ReLUs and discovered that the conditions were the same up to a factor 2. The difficulty Simonyan and Zisserman had training VGG is apparently avoidable, simply by using Xavier intialization (or better yet the ReLU adjusted version). Using this technique, He et al. reportedly trained a whopping 30-layer deep network to convergence in one go.

Another recent paper tackling the signal scaling problem is by Ioffe and Szegedy[4]. They call the change in scale internal covariate shift and claim this forces learning rates to be unnecessarily small. They suggest that if all layers have the same scale and remain so throughout training, a much higher learning rate becomes practically viable. You cannot just standardize the signals, since you would lose expressive power (the bias disappears and in the case of sigmoids we would be constrained to the linear regime). They solve this by re-introducing two parameters per layer, scaling and bias, added again after standardization. The training reportedly becomes about 6 times faster and they present state-of-the-art results on ImageNet. However, I'm not certain this is the solution that will stick.

I reckon we will see a lot more work on this frontier in the next few years. Especially since it also relates to the -- right now wildly popular -- Recurrent Neural Network (RNN), which connects output signals back as inputs. The way you train such network is that you unroll the time axis, treating the result as an extremely deep feedforward network. This greatly exacerbates the vanishing gradient problem. A popular solution, called Long Short-Term Memory (LSTM), is to introduce memory cells, which are a type of teleport that allows a signal to jump ahead many time steps. This means that the gradient is retained for all those time steps and can be propagated back to a much earlier time without vanishing.

This area is far from solved, and until then I think I will be sticking to Xavier initialization. If you are using Caffe, the one take-away of this post is to use the following on all your layers:

weight_filler {
type: "xavier"
}

References

  1. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256.

  2. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [pdf]

  3. K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” arXiv:1502.01852 [cs], Feb. 2015. [pdf]

  4. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Feb. 2015. [pdf]

Initialization of deep networks的更多相关文章

  1. [Box] Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint

    目录 概 主要内容 LSGD Box 初始化 Box for Resnet 代码 Cyr E C, Gulian M, Patel R G, et al. Robust Training and In ...

  2. Deep Learning 8_深度学习UFLDL教程:Stacked Autocoders and Implement deep networks for digit classification_Exercise(斯坦福大学深度学习教程)

    前言 1.理论知识:UFLDL教程.Deep learning:十六(deep networks) 2.实验环境:win7, matlab2015b,16G内存,2T硬盘 3.实验内容:Exercis ...

  3. 基于pytorch实现HighWay Networks之Train Deep Networks

    (一)Highway Networks 与 Deep Networks 的关系 理论实践表明神经网络的深度是至关重要的,深层神经网络在很多方面都已经取得了很好的效果,例如,在1000-class Im ...

  4. 论文笔记:SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

    SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks 2019-04-02 12:44:36 Paper:ht ...

  5. 论文笔记:Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ICML 2017 Paper:https://arxiv.org/ ...

  6. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  7. 深度学习材料:从感知机到深度网络A Deep Learning Tutorial: From Perceptrons to Deep Networks

    In recent years, there’s been a resurgence in the field of Artificial Intelligence. It’s spread beyo ...

  8. Deep Networks for Image Super-Resolution with Sparse Prior

    深度学习中潜藏的稀疏表达 Deep Networks for Image Super-Resolution with Sparse Prior http://www.ifp.illinois.edu/ ...

  9. Training Very Deep Networks

    Rupesh Kumar SrivastavaKlaus Greff ̈J urgenSchmidhuberThe Swiss AI Lab IDSIA / USI / SUPSI{rupesh, k ...

随机推荐

  1. 双绞线线序+POE供电网线

    0 重点 一般情况下会用1236(橙白.橙.绿白.绿)传输数据,1.2用于发送,3.6用于接收,45(蓝.蓝白)电源正极 78(棕白.棕)电源负极. 一 网线线序 12发 36收 二 poe网线供电 ...

  2. 通过HttpClient来调用Web Api接口~续~实体参数的传递

    并且我们知道了Post,Put方法只能有一个FromBody参数,再有多个参数时,上讲提到,需要将它封装成一个对象进行传递,而这讲主要围绕这个话题来说,接口层添加一个新类User_Info,用来进行数 ...

  3. caffe windows 学习第一步:编译和安装(vs2012+win 64)

    没有GPU,没有linux, 只好装caffe的windows版本了. 我的系统是win10(64位),vs 2012版本,其它什么都没有装,因此会需要一切的依赖库. 其实操作系统只要是64位就行了, ...

  4. [CareerCup] 8.8 Othello Game 黑白棋游戏

    8.8 Othello is played as follows: Each Othello piece is white on one side and black on the other. Wh ...

  5. 实验楼实验——LINUX基础入门

    第一节 Linux简介 一.Linux的历史: 1965 年,Bell 实验室.MIT.GE(通用电气公司)准备开发 Multics 系统,为了同时支持 300 个终端访问主机,但是 1969 年失败 ...

  6. valueOf和toString

    有一道经典的题目: var add = function() {___}; console.log(add(3)(4)(5)); // 输出60 题目要求能无限相乘,请补充add函数. 首先很显然,a ...

  7. [转]简单识别 RESTful 接口

         本文描述了识别一个接口是否真的是 RESTful 接口的基本方法.符合 REST 架构风格的接口,称为 RESTful 接口.本文不打算从架构风格的推导方面描述,而是从 HTTP 标准的方面 ...

  8. 一句话概括下spring框架及spring cloud框架主要组件

    作为java的屌丝,基本上跟上spring屌丝的步伐,也就跟上了主流技术.spring 顶级项目:Spring IO platform:用于系统部署,是可集成的,构建现代化应用的版本平台,具体来说当你 ...

  9. Android--自动搜索提示

    一. 效果图 在Google或者百度搜索的时候,在输入关键词都会出现自动搜索的提示内容,类似如下的效果,输入b 则出现包含b的相关词条 二. 布局代码 <?xml version="1 ...

  10. js中模仿接口继承

    一般情况下我们会这样写,但是这样写的话,不够美化或者直观. 如果我们可以这样写的话,感觉更好: 但是样子的话,我们没有考虑原型覆盖之类的,因为我们通常的情况,我们继承只有一层,在通常情况下,我们原型覆 ...