这篇论文最早是一篇2016年1月16日发表在Sebastian Ruder的博客。本文主要工作是对这篇论文与李宏毅课程相关的核心部分进行翻译。

论文全文翻译:

An overview of gradient descent optimization algorithms

梯度下降优化算法概述

0. Abstract 摘要:

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

梯度下降优化算法虽然很流行,但通常用作黑盒优化,所以对于它们的优缺点很难作出实际的解释。

This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.

这篇论文旨在帮助读者建立对于不同算法性能表现的直觉,以便更好地使用这些算法。

In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.

这篇论文介绍了几种不同的梯度下降算法,以及它们所面临的挑战。还介绍了最常用的优化算法,并行和分布式架构,以及其他梯度下降算法优化的策略。

1. Introduction 引言:

Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.

梯度下降是最流行的其中一种执行优化的算法,也是到目前为止用的最多的神经网络优化算法。

At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).

同时,各种最新的深度学习库(如:lasagne,caffe,keras)都实现了很多种梯度下降的优化算法。

These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.

然而,这些算法通常用作黑盒优化,很难对于它们的优缺点作出实际解释。

This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.

这篇论文旨在帮助读者建立对于不同梯度下降优化算法的性能表现的直觉,以便更好地使用这些算法。

In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.

在第二章,我们首先看一下不同的梯度下降算法,然后在第三章,简要总结一下算法训练过程中面临的挑战。

Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.

接下来,在第四章介绍了最常见的优化算法,以及它们如何应对挑战,并相应地更新规则。

Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.

然后,在第五章简要介绍了在并行及分布式架构下梯度下降的优化算法及框架。

Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.

最后,第六章介绍了一些其他有用的梯度下降优化策略。

Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in R^d\) by updating the parameters in the opposite direction of the gradient of the objective function \({\nabla}_{\theta} J({\theta})\) w.r.t. to the parameters.

梯度下降方法就是对于目标函数 \(J(\theta)\),计算梯度 \({\nabla}_{\theta} J({\theta})\) ,并负向更新参数 \(\theta \in R^d\),使得目标函数最小。

The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum.

学习率 \(\eta\) 确定了我们逼近(局部)最小值的步长。

In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

换而言之,就是我们沿着目标函数的斜坡下降的方向走,知道到达谷底。

2. Gradient descent variants 梯度下降的变体

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.

梯度下降有三种变体,他们的不同之处在于用来计算目标函数下降梯度的数据量不同。

Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

根据数据量的不同,我们在参数更新的精度和更新时间之间作出权衡。

2.1 Batch gradient descent 批量梯度下降

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:

Vanilla梯度下降,也叫作批量梯度下降,通过整个训练数据集,计算损失函数关于参数 \(\theta\) 的梯度:

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)

As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.

由于每次更新都需要通过整个数据集来计算梯度,所以批量梯度下降的计算速度很慢,而且对于超出内存限制的数据量很难处理。

Batch gradient descent also does not allow us to update our model online, i.e. with new examples on-the-fly.

批量梯度下降也不允许在线更新模型,也就是在运行中不能添加新的样本数据。

In code, batch gradient descent looks something like this:

批量梯度下降的代码如下:

for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad

For a pre-defined number of epochs, we first compute the gradient vector params_grad of the loss function for the whole dataset w.r.t. our parameter vector params.

对于一个给定的迭代次数epochs,我们首先利用整个数据集计算关于参数向量 params 的损失函数 param_grad 的梯度。

Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters.

注意,很多最新的深度学习库都提供了自动求导的功能,可以高效地计算关于参数的梯度。

If you derive the gradients yourself, then gradient checking is a good idea.

如果你自己实现梯度计算,那么梯度检查是很好的。

We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform.

接下来我们沿着负梯度方向更新参数,更新参数的步长由学习率决定。

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

批量梯度下降保证最终将收敛到凸函数的全局最小值,或者非凸函数的局部最小值。

2.2 Stochastic gradient descent 批量梯度下降

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example \(x^{(i)}\) and label \(y^{(i)}\) :

相对而言,随机梯度下降算法(SGD)是对其中一个训练样本(\(x^{(i)}, y^{(i)}\))求梯度并更新参数:

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta; x^{(i)}, y^{(i)}})\) ---- (2)

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.

批量梯度下降在大数据集上会有很多冗余计算,因为它在每次更新参数时重复计算相似样本的梯度。

SGD does away with this redundancy by performing one update at a time.

随机梯度下降(SGD)每次通过单个样本更新参数以消除冗余。

It is therefore usually much faster and can also be used to learn online.

因此它通常速度更快且可以在线学习。

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Figure 1.

SGD更新更加频繁,其损失函数的方差更大,导致目标函数剧烈震荡。

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.

批量梯度下降的参数会收敛到参数所在波谷的局部最小值,而随机梯度下降(SGD)则由于数据波动,可能跳跃到一个新的更好的局部最小值。

On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.

另一方面,最终收敛到确切最小值的这一过程变得更加复杂,因为SGD的参数变化在持续震荡。

However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

然而,事实证明,当我们逐渐地减小学习率,SGD表现出和批量梯度下降一样的收敛效果,同样收敛到了局部最小值(非凸)或全局最小值(凸优化)。

Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example.

SGD的代码片段仅仅是在对各组训练样本的遍历和利用每一组样本计算梯度的过程中增加一层循环。

Note that we shuffle the training data at every epoch as explained in Section 6.1.

注意我们在每一次循环中都要先对训练数据进行“洗牌”。

for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad

2.3 Mini-batch gradient descent 小批量梯度下降

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:

小批量梯度下降集合了上面两种方法的优点,每次对n个训练样本进行小批量的参数更新。

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta}; x^{i:i+n}; y^{i:i+n})\) ---- (3)

This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence;

and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.

这种方式,一方面减少了参数更新时的方差,收敛地更平稳;另一方面,能够更高效地利用最新的深度学习库的矩阵计算优化技术来计算梯度。

Common mini-batch sizes range between 50 and 256, but can vary for different applications.

小批量的大小一般在50~256之间,也可以根据具体应用来调整。

Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.

小批量梯度下降是典型的神经网络训练算法之一,SGD一词也可以指小批量梯度下降算法。

Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.

注意:为了简便起见,下文对于SGD的改进中我们省略了\(x^{(i:i+n)}; y^{(i:i+n)}\)参数。

In code, instead of iterating over examples, we now iterate over mini-batches of size 50:

代码如下。不同于之前遍历每个单一样本,我们现在迭代的是每个大小为50个样本的小批量:

for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad

【论文翻译】An overiview of gradient descent optimization algorithms的更多相关文章

  1. (转) An overview of gradient descent optimization algorithms

    An overview of gradient descent optimization algorithms Table of contents: Gradient descent variants ...

  2. An overview of gradient descent optimization algorithms

    原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimiz ...

  3. An overview of gradient descent optimization algorithms (更新到Adam)

    Momentum:解快了收敛速度,同时也减弱了SGD的波动 NAG: 减速了Momentum更新参数太快 Adagrad: 出现频率较低参数采用较大的更新,对于出现频率较高的参数采用较小的,不共用一个 ...

  4. 论文翻译:2021_Decoupling magnitude and phase optimization with a two-stage deep network

    论文地址:两阶段深度网络的解耦幅度和相位优化 论文代码: 引用格式:Li A, Liu W, Luo X, et al. ICASSP 2021 deep noise suppression chal ...

  5. <反向传播(backprop)>梯度下降法gradient descent的发展历史与各版本

    梯度下降法作为一种反向传播算法最早在上世纪由geoffrey hinton等人提出并被广泛接受.最早GD由很多研究团队各自发表,可他们大多无人问津,而hinton做的研究完整表述了GD方法,同时hin ...

  6. (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

    Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...

  7. Stochastic Gradient Descent

    一.从Multinomial Logistic模型说起 1.Multinomial Logistic 令为维输入向量; 为输出label;(一共k类); 为模型参数向量: Multinomial Lo ...

  8. 课程二(Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization),第二周(Optimization algorithms) —— 2.Programming assignments:Optimization

    Optimization Welcome to the optimization's programming assignment of the hyper-parameters tuning spe ...

  9. 论文翻译--StarCraft Micromanagement with Reinforcement Learning and Curriculum Transfer Learning

    (缺少一些公式的图或者效果图,评论区有惊喜) (个人学习这篇论文时进行的翻译[谷歌翻译,你懂的],如有侵权等,请告知) StarCraft Micromanagement with Reinforce ...

随机推荐

  1. 【Qt学习笔记】Qt+VS2010的配置

    http://blog.csdn.net/jocyln9026/article/details/8575218 关于Qt Qt是1991年由Trolltech公司开发的一个跨平台的C++图形用户界面应 ...

  2. Markdown 教程

    Markdown 简介 Markdown 是一种轻量级标记语言,它允许人们使用易读易写的纯文本格式编写文档. Markdown 语言在 2004 由约翰·格鲁伯(英语:John Gruber)创建. ...

  3. gcc 相关总结 动态链接库

    #include < >与#include " " #include < >:直接到系统指定的目录中去找头文件. #include " " ...

  4. [redis读书笔记] 第一部分 数据结构与对象 对象特性

    一 类型检查和多态    类型检查,即有的命令是只针对特定类型的,如果类型不对,就会报错,此处的类型,是指的键类型,即robj.type.下面为有类型检查的命令: 对于某一种类型,redis下底层的实 ...

  5. javascirpt获取随机数

    /* getran(min, max, n): 获取min与max之间的随机数 n: n保留浮点数数量 */ function getran(min, max, n){ return Number(( ...

  6. 20200221--python学习第14天

    今日内容 带参数的装饰器:flash框架+django缓存+写装饰器实现被装饰的函数要执行N次 模块: os sys time datetime和timezone[了解] 内容回顾与补充 1.函数 写 ...

  7. Visual C# 2015调用SnmpSharpNet库实现简单的SNMP元素查询

    一开始调研发现有几个SNMP的库, 一个是net-SNMP,这个好像是linux用的多 一个是微软自己的WinSNMP,这个没有例子,不太好操作 一个是SnmpSharpNet,这个有些例子比较好, ...

  8. Linux 报错:syntax error "C" 解决办法(此处选择bash系统)

    出现此问题的原因,是由系统的兼容性引起的,linux下默认了指向dash而非bash. linux下Dash改Bash: 1.先查看是使用哪个shell ls -al /bin/sh 2.#如果是Da ...

  9. Element-UI ( Dropdow )下拉菜单组件command传输对象

    通过 :command绑定对象数据,handleCommand方法处理数据 template <div v-for="(item, index) in FlyWarningList&q ...

  10. 仅需60秒,使用k3s创建一个多节点K8S集群!

    作者: Dawid Ziolkowski丨Container Solution云原生工程师 最近,我一直在Kubernetes上进行各种测试和部署.因此,我不得不一次又一次创建和销毁Kubernete ...