这篇论文最早是一篇2016年1月16日发表在Sebastian Ruder的博客。本文主要工作是对这篇论文与李宏毅课程相关的核心部分进行翻译。

论文全文翻译:

An overview of gradient descent optimization algorithms

梯度下降优化算法概述

0. Abstract 摘要:

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

梯度下降优化算法虽然很流行,但通常用作黑盒优化,所以对于它们的优缺点很难作出实际的解释。

This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.

这篇论文旨在帮助读者建立对于不同算法性能表现的直觉,以便更好地使用这些算法。

In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.

这篇论文介绍了几种不同的梯度下降算法,以及它们所面临的挑战。还介绍了最常用的优化算法,并行和分布式架构,以及其他梯度下降算法优化的策略。

1. Introduction 引言:

Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.

梯度下降是最流行的其中一种执行优化的算法,也是到目前为止用的最多的神经网络优化算法。

At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).

同时,各种最新的深度学习库(如:lasagne,caffe,keras)都实现了很多种梯度下降的优化算法。

These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.

然而,这些算法通常用作黑盒优化,很难对于它们的优缺点作出实际解释。

This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.

这篇论文旨在帮助读者建立对于不同梯度下降优化算法的性能表现的直觉,以便更好地使用这些算法。

In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.

在第二章,我们首先看一下不同的梯度下降算法,然后在第三章,简要总结一下算法训练过程中面临的挑战。

Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.

接下来,在第四章介绍了最常见的优化算法,以及它们如何应对挑战,并相应地更新规则。

Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.

然后,在第五章简要介绍了在并行及分布式架构下梯度下降的优化算法及框架。

Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.

最后,第六章介绍了一些其他有用的梯度下降优化策略。

Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in R^d\) by updating the parameters in the opposite direction of the gradient of the objective function \({\nabla}_{\theta} J({\theta})\) w.r.t. to the parameters.

梯度下降方法就是对于目标函数 \(J(\theta)\),计算梯度 \({\nabla}_{\theta} J({\theta})\) ,并负向更新参数 \(\theta \in R^d\),使得目标函数最小。

The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum.

学习率 \(\eta\) 确定了我们逼近(局部)最小值的步长。

In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

换而言之,就是我们沿着目标函数的斜坡下降的方向走,知道到达谷底。

2. Gradient descent variants 梯度下降的变体

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.

梯度下降有三种变体,他们的不同之处在于用来计算目标函数下降梯度的数据量不同。

Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

根据数据量的不同,我们在参数更新的精度和更新时间之间作出权衡。

2.1 Batch gradient descent 批量梯度下降

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:

Vanilla梯度下降,也叫作批量梯度下降,通过整个训练数据集,计算损失函数关于参数 \(\theta\) 的梯度:

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)

As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.

由于每次更新都需要通过整个数据集来计算梯度,所以批量梯度下降的计算速度很慢,而且对于超出内存限制的数据量很难处理。

Batch gradient descent also does not allow us to update our model online, i.e. with new examples on-the-fly.

批量梯度下降也不允许在线更新模型,也就是在运行中不能添加新的样本数据。

In code, batch gradient descent looks something like this:

批量梯度下降的代码如下:

for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad

For a pre-defined number of epochs, we first compute the gradient vector params_grad of the loss function for the whole dataset w.r.t. our parameter vector params.

对于一个给定的迭代次数epochs,我们首先利用整个数据集计算关于参数向量 params 的损失函数 param_grad 的梯度。

Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters.

注意,很多最新的深度学习库都提供了自动求导的功能,可以高效地计算关于参数的梯度。

If you derive the gradients yourself, then gradient checking is a good idea.

如果你自己实现梯度计算,那么梯度检查是很好的。

We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform.

接下来我们沿着负梯度方向更新参数,更新参数的步长由学习率决定。

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

批量梯度下降保证最终将收敛到凸函数的全局最小值,或者非凸函数的局部最小值。

2.2 Stochastic gradient descent 批量梯度下降

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example \(x^{(i)}\) and label \(y^{(i)}\) :

相对而言,随机梯度下降算法(SGD)是对其中一个训练样本(\(x^{(i)}, y^{(i)}\))求梯度并更新参数:

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta; x^{(i)}, y^{(i)}})\) ---- (2)

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.

批量梯度下降在大数据集上会有很多冗余计算,因为它在每次更新参数时重复计算相似样本的梯度。

SGD does away with this redundancy by performing one update at a time.

随机梯度下降(SGD)每次通过单个样本更新参数以消除冗余。

It is therefore usually much faster and can also be used to learn online.

因此它通常速度更快且可以在线学习。

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Figure 1.

SGD更新更加频繁,其损失函数的方差更大,导致目标函数剧烈震荡。

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.

批量梯度下降的参数会收敛到参数所在波谷的局部最小值,而随机梯度下降(SGD)则由于数据波动,可能跳跃到一个新的更好的局部最小值。

On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.

另一方面,最终收敛到确切最小值的这一过程变得更加复杂,因为SGD的参数变化在持续震荡。

However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

然而,事实证明,当我们逐渐地减小学习率,SGD表现出和批量梯度下降一样的收敛效果,同样收敛到了局部最小值(非凸)或全局最小值(凸优化)。

Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example.

SGD的代码片段仅仅是在对各组训练样本的遍历和利用每一组样本计算梯度的过程中增加一层循环。

Note that we shuffle the training data at every epoch as explained in Section 6.1.

注意我们在每一次循环中都要先对训练数据进行“洗牌”。

for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad

2.3 Mini-batch gradient descent 小批量梯度下降

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:

小批量梯度下降集合了上面两种方法的优点,每次对n个训练样本进行小批量的参数更新。

\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta}; x^{i:i+n}; y^{i:i+n})\) ---- (3)

This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence;

and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.

这种方式,一方面减少了参数更新时的方差,收敛地更平稳;另一方面,能够更高效地利用最新的深度学习库的矩阵计算优化技术来计算梯度。

Common mini-batch sizes range between 50 and 256, but can vary for different applications.

小批量的大小一般在50~256之间,也可以根据具体应用来调整。

Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.

小批量梯度下降是典型的神经网络训练算法之一,SGD一词也可以指小批量梯度下降算法。

Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.

注意:为了简便起见,下文对于SGD的改进中我们省略了\(x^{(i:i+n)}; y^{(i:i+n)}\)参数。

In code, instead of iterating over examples, we now iterate over mini-batches of size 50:

代码如下。不同于之前遍历每个单一样本,我们现在迭代的是每个大小为50个样本的小批量:

for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad

【论文翻译】An overiview of gradient descent optimization algorithms的更多相关文章

  1. (转) An overview of gradient descent optimization algorithms

    An overview of gradient descent optimization algorithms Table of contents: Gradient descent variants ...

  2. An overview of gradient descent optimization algorithms

    原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimiz ...

  3. An overview of gradient descent optimization algorithms (更新到Adam)

    Momentum:解快了收敛速度,同时也减弱了SGD的波动 NAG: 减速了Momentum更新参数太快 Adagrad: 出现频率较低参数采用较大的更新,对于出现频率较高的参数采用较小的,不共用一个 ...

  4. 论文翻译:2021_Decoupling magnitude and phase optimization with a two-stage deep network

    论文地址:两阶段深度网络的解耦幅度和相位优化 论文代码: 引用格式:Li A, Liu W, Luo X, et al. ICASSP 2021 deep noise suppression chal ...

  5. <反向传播(backprop)>梯度下降法gradient descent的发展历史与各版本

    梯度下降法作为一种反向传播算法最早在上世纪由geoffrey hinton等人提出并被广泛接受.最早GD由很多研究团队各自发表,可他们大多无人问津,而hinton做的研究完整表述了GD方法,同时hin ...

  6. (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

    Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...

  7. Stochastic Gradient Descent

    一.从Multinomial Logistic模型说起 1.Multinomial Logistic 令为维输入向量; 为输出label;(一共k类); 为模型参数向量: Multinomial Lo ...

  8. 课程二(Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization),第二周(Optimization algorithms) —— 2.Programming assignments:Optimization

    Optimization Welcome to the optimization's programming assignment of the hyper-parameters tuning spe ...

  9. 论文翻译--StarCraft Micromanagement with Reinforcement Learning and Curriculum Transfer Learning

    (缺少一些公式的图或者效果图,评论区有惊喜) (个人学习这篇论文时进行的翻译[谷歌翻译,你懂的],如有侵权等,请告知) StarCraft Micromanagement with Reinforce ...

随机推荐

  1. 《 Java 编程思想》CH02 一切都是对象

    用引用操纵对象 尽管Java中一切都看作为对象,但是操纵的标识符实际上对象的一个"引用". String s; // 这里只是创建了一个引用,而不是一个对象 String s = ...

  2. TensorFlow 中的张量,图,会话

    tensor的含义是张量,张量是什么,听起来很高深的样子,其实我们对于张量一点都不陌生,因为像标量,向量,矩阵这些都可以被认为是特殊的张量.如下图所示: 在TensorFlow中,tensor实际上就 ...

  3. 12306 抢票系列之只要搞定RAIL_DEVICEID的来源,从此抢票不再掉线(上)

    郑重声明: 本文仅供学习使用,禁止用于非法用途,否则后果自负,如有侵权,烦请告知删除,谢谢合作! 开篇明义 本文针对自主开发的抢票脚本在抢票过程中常常遇到的请求无效等问题,简单分析了 12306 网站 ...

  4. learn about sqlserver partitition and partition table 1

    Dear all, Let get into business, the partitions on sql server is very different with that on oracle. ...

  5. 交换机 路由器 防火墙asa 安全访问、配置 方式

    这里交换机 路由器 暂时统称为  网络设备 我们一般管理网络设备采用的几种方法 一般来说,可以用5种方式来设置路由器: 1. Console口接终端或运行终端仿真软件的微机(第一次配置要使用此方式) ...

  6. JVM性能优化系列-(5) 早期编译优化

    5. 早期编译优化 早起编译优化主要指编译期进行的优化. java的编译期可能指的以下三种: 前端编译器:将.java文件变成.class文件,例如Sun的Javac.Eclipse JDT中的增量式 ...

  7. h5笔记1

    1.HTML中不支持 空格.回车.制表符,它们都会被解析成一个空白字符 2.适用于大多数 HTML 元素的属性: class 为html元素定义一个或多个类名(classname)(类名从样式文件引入 ...

  8. Linux文件结构-底层文件访问&文件目录和维护

    每个运行中的程序被称为进程(process),它有一些与之关联的文件描述符(一些小值整数).可以通过文件描述符访问打开的文件或设备. 一个程序运行时,一般会有三个文件描述符与之对应 0:标准输入 1: ...

  9. OpenLayers 6 学习笔记2 WMS服务避坑记录

    心血来潮,花1小时安装软件写代码+复习api,顺便熟悉一波wms 再次强化认知了wms获取要素的能力没有wfs强,有待考究 原文链接(转载请声明@秋意正寒  博客园/知乎/B站/csdn/小专栏):h ...

  10. 基于arduino的气象站

    bmp180的简介: • 压力范围:~1100hPa(海拔 米~- 米) • 电源电压:.8V~.6V(VDDA), .62V~.6V(VDDD) • 尺寸:.6mmx3.8x0.93mm • 低功耗 ...