Stochastic Optimization Techniques
Stochastic Optimization Techniques
Neural networks are often trained stochastically, i.e. using a method where the objective function changes at each iteration. This stochastic variation is due to the model being trained on different data during each iteration. This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. Furthermore, training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately, many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.
In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to the parameter, scaled by the learning rate $\eta$, a hyperparameter. If $\eta$ is too large, SGD will diverge; if it's too small, it will converge slowly. The update rule is simply $$ \theta_{t + 1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t) $$
Momentum
In SGD, the gradient $\nabla \mathcal{L}(\theta_t)$ often changes rapidly at each iteration $t$ due to the fact that the loss is being computed over different data. This is often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperparameter $\mu$, as follows:
\begin{align*} v_{t + 1} &= \mu v_t - \eta \nabla \mathcal{L}(\theta_t) \\ \theta_{t + 1} &= \theta_t + v_{t+1} \end{align*}
It has been argued that including the previous gradient step has the effect of approximating some second-order information about the gradient.
Nesterov's Accelerated Gradient
In Nesterov's Accelerated Gradient (NAG), the gradient of the loss at each step is computed at $\theta_t + \mu v_t$ instead of $\theta_t$. In momentum, the parameter update could be written $\theta_{t + 1} = \theta_t + \mu v_t - \eta \nabla \mathcal{L}(\theta_t)$, so NAG effectively computes the gradient at the new parameter location but without considering the gradient term. In practice, this causes NAG to behave more stably than regular momentum in many situations. A more thorough analysis can be found in 1). The update rules are then as follows:
\begin{align*} v_{t + 1} &= \mu v_t - \eta \nabla\mathcal{L}(\theta_t + \mu v_t) \\ \theta_{t + 1} &= \theta_t + v_{t+1} \end{align*}
Adagrad
Adagrad effectively rescales the learning rate for each parameter according to the history of the gradients for that parameter. This is done by dividing each term in $\nabla \mathcal{L}$ by the square root of the sum of squares of its historical gradient. Rescaling in this way effectively lowers the learning rate for parameters which consistently have large gradient values. It also effectively decreases the learning rate over time, because the sum of squares will continue to grow with the iteration. After setting the rescaling term $g = 0$, the updates are as follows: \begin{align*} g_{t + 1} &= g_t + \nabla \mathcal{L}(\theta_t)^2 \\ \theta_{t + 1} &= \theta_t - \frac{\eta\nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*} where division is elementwise and $\epsilon$ is a small constant included for numerical stability. It has nice theoretical guarantees and empirical results 2) 3).
RMSProp
In its originally proposed form 4), RMSProp is very similar to Adagrad. The only difference is that the $g_t$ term is computed as a exponentially decaying average instead of an accumulated sum. This makes $g_t$ an estimate of the second moment of $\nabla \mathcal{L}$ and avoids the fact that the learning rate effectively shrinks over time. The name “RMSProp” comes from the fact that the update step is normalized by a decaying RMS of recent gradients. The update is as follows:
\begin{align*} g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ \theta_{t + 1} &= \theta_t - \frac{\eta\nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*}
In the original lecture slides where it was proposed, $\gamma$ is set to $.9$. In 5), it is shown that the $\sqrt{g_{t + 1}}$ term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are $\mathcal{N}(0, 1)$ distributed). It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points.
Alternatively, in 6), a first-order moment approximator $m_t$ is added. It is included in the denominator of the preconditioner so that the learning rate is effectively normalized by the standard deviation $\nabla \mathcal{L}$. There is also a $v_t$ term included for momentum. This gives
\begin{align*} m_{t + 1} &= \gamma m_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t) \\ g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ v_{t + 1} &= \mu v_t - \frac{\eta \nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t+1} - m_{t+1}^2 + \epsilon}} \\ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}
Adadelta
Adadelta 7) uses the same exponentially decaying moving average estimate of the gradient second moment $g_t$ as RMSProp. It also computes a moving average $x_t$ of the updates $v_t$ similar to momentum, but when updating this quantity it squares the current step, which I don't have any intuition for.
\begin{align*} g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ v_{t + 1} &= -\frac{\sqrt{x_t + \epsilon} \nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t+1} + \epsilon}} \\ x_{t + 1} &= \gamma x_t + (1 - \gamma) v_{t + 1}^2 \\ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}
Adam
Adam is somewhat similar to Adagrad/Adadelta/RMSProp in that it computes a decayed moving average of the gradient and squared gradient (first and second moment estimates) at each time step. It differs mainly in two ways: First, the first order moment moving average coefficient is decayed over time. Second, because the first and second order moment estimates are initialized to zero, some bias-correction is used to counteract the resulting bias towards zero. The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is $\approx \pm \eta$ and that in magnitude it is less than $\eta$. However, as $\theta_t$ approaches a true minimum, the uncertainty of the gradient will increase and the step size will decrease. It is also invariant to the scale of the gradients. Given hyperparameters $\gamma_1$, $\gamma_2$, $\lambda$, and $\eta$, and setting $m_0 = 0$ and $g_0 = 0$ (note that the paper denotes $\gamma_1$ as $\beta_1$, $\gamma_2$ as $\beta_2$, $\eta$ as $\alpha$ and $g_t$ as $v_t$), the update rule is as follows: 8)
\begin{align*} m_{t + 1} &= \gamma_1 m_t + (1 - \gamma_1) \nabla \mathcal{L}(\theta_t) \\ g_{t + 1} &= \gamma_2 g_t + (1 - \gamma_2) \nabla \mathcal{L}(\theta_t)^2 \\ \hat{m}_{t + 1} &= \frac{m_{t + 1}}{1 - \gamma_1^{t + 1}} \\ \hat{g}_{t + 1} &= \frac{g_{t + 1}}{1 - \gamma_2^{t + 1}} \\ \theta_{t + 1} &= \theta_t - \frac{\eta \hat{m}_{t + 1}}{\sqrt{\hat{g}_{t + 1}} + \epsilon} \end{align*}
ESGD
Adasecant
vSGD
Rprop
Stochastic Optimization Techniques的更多相关文章
- TensorFlow 深度学习笔记 Stochastic Optimization
Stochastic Optimization 转载请注明作者:梦里风林 Github工程地址:https://github.com/ahangchen/GDLnotes 欢迎star,有问题可以到I ...
- ADAM : A METHOD FOR STOCHASTIC OPTIMIZATION
目录 概 主要内容 算法 选择合适的参数 一些别的优化算法 AdaMax 理论 代码 Kingma D P, Ba J. Adam: A Method for Stochastic Optimizat ...
- Stochastic Optimization of PCA with Capped MSG
目录 Problem Matrix Stochastic Gradient 算法(MSG) 步骤二(单次迭代) 单步SVD \(project()\)算法 \(rounding()\) 从这里回溯到此 ...
- Training Deep Neural Networks
http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html //转载于 Training Deep Neural ...
- (zhuan) Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning this blog from: https://blo ...
- KDD2016,Accepted Papers
RESEARCH TRACK PAPERS - ORAL Title & Authors NetCycle: Collective Evolution Inference in Heterog ...
- An overview of gradient descent optimization algorithms
原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimiz ...
- (转) An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Table of contents: Gradient descent variants ...
- First release of mlrMBO - the toolbox for (Bayesian) Black-Box Optimization
We are happy to finally announce the first release of mlrMBO on cran after a quite long development ...
随机推荐
- 软件测试_APP测试_主要测试内容
最近要测试手机端APP,所以查找了一下有关APP测试需要注意的事项,做了一下总结.如有补充,欢迎评论! 手机APP测试与WEB测试其实相似,但是也有特别需要注意的一些不同点,此处只列出部分注意事项,相 ...
- linux中使sqlplus能够上下翻页
安装包链接:https://pan.baidu.com/s/1WsQTeEQClM88aEqIvNi2ag 提取码:s241 rlwrap-0.37-1.el6.x86_64.rpm 和 rlwra ...
- telnet协议:简介与安装使用
Telnet简介 Telnet协议是TCP/IP协议族中的一员,是Internet远程登陆服务的标准协议和主要方式.它为用户提供了在本地计算机上完成远程主机工作的能力.在终端使用者的电脑上使用teln ...
- 金蝶盘点机PDA条码数据采集器WMS系统具体有哪些功能
1. 使用汉码盘点机PDA实现仓库条码管理的好处 (1) 传统电脑管理软件出入库需要来回电脑跑人工手工电脑录单效率低,通过人眼识别商品品种和清点商品数量,容易造成录单错误.从而造成电脑管理软件库存 ...
- sqli-labs 下载、安装
sqli-labs 下载.安装 下载地址:https://github.com/Audi-1/sqli-labs phpstudy:http://down.php.cn/PhpStudy2018021 ...
- one team
Double H Team 1.队员 王熙航211606379(队长) 李冠锐211606364 曾磊鑫211606350 戴俊涵211606359 聂寒冰211606324 杨艺勇211606342 ...
- linux 内核 第四周 扒开系统调用的三层皮 上
姬梦馨 原创作品 http://mooc.study.163.com/course/USTC-1000029000 一.用户态.内核态和中断处理过程 用户通过库函数与系统调用联系起来:库函数帮我们把系 ...
- LINUX内核分析第五周学习总结——扒开应用系统的三层皮(下)
LINUX内核分析第五周学习总结——扒开应用系统的三层皮(下) 张忻(原创作品转载请注明出处) <Linux内核分析>MOOC课程http://mooc.study.163.com/cou ...
- ELK 性能(1) — Logstash 性能及其替代方案
ELK 性能(1) - Logstash 性能及其替代方案 介绍 当谈及集中日志到 Elasticsearch 时,首先想到的日志传输(log shipper)就是 Logstash.开发者听说过它, ...
- Alpha 冲刺一
团队成员 051601135 岳冠宇 051604103 陈思孝 031602629 刘意晗 031602248 郑智文 031602234 王淇 会议照片 项目燃尽图 项目进展 界面(简陋) 登录界 ...