https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize.

https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks/31842945

In the neural network terminology:

  • one epoch = one forward pass and one backward pass of all the training examples
  • batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
  • number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/

Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples.

Overview

Batch methods, such as limited memory BFGS, which use the full training set to compute the next update to parameters at each iteration tend to converge very well to local optima. They are also straight forward to get working provided a good off the shelf implementation (e.g. minFunc) because they have very few hyper-parameters to tune. However, often in practice computing the cost and gradient for the entire training set can be very slow and sometimes intractable on a single machine if the dataset is too big to fit in main memory. Another issue with batch optimization methods is that they don’t give an easy way to incorporate new data in an ‘online’ setting. Stochastic Gradient Descent (SGD) addresses both of these issues by following the negative gradient of the objective after seeing only a single or a few training examples. The use of SGD In the neural network setting is motivated by the high cost of running back propagation over the full training set. SGD can overcome this cost and still lead to fast convergence.

Stochastic Gradient Descent

The standard gradient descent algorithm updates the parameters θθ of the objective J(θ)J(θ) as,

θ=θ−α∇θE[J(θ)]θ=θ−α∇θE[J(θ)]

where the expectation in the above equation is approximated by evaluating the cost and gradient over the full training set. Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. The new update is given by,

θ=θ−α∇θJ(θ;x(i),y(i))θ=θ−α∇θJ(θ;x(i),y(i))

with a pair (x(i),y(i))(x(i),y(i)) from the training set.

What is the difference between iterations and epochs in Convolution neural networks?的更多相关文章

  1. Training (deep) Neural Networks Part: 1

    Training (deep) Neural Networks Part: 1 Nowadays training deep learning models have become extremely ...

  2. Online handwriting recognition using multi convolution neural networks

    w可以考虑从计算机的“机械性.重复性”特征去设计“低效的”算法. https://www.codeproject.com/articles/523074/webcontrols/ Online han ...

  3. (转)Understanding, generalisation, and transfer learning in deep neural networks

    Understanding, generalisation, and transfer learning in deep neural networks FEBRUARY 27, 2017   Thi ...

  4. [C6] Andrew Ng - Convolutional Neural Networks

    About this Course This course will teach you how to build convolutional neural networks and apply it ...

  5. 2017年计算语义相似度最新论文,击败了siamese lstm,非监督学习

    Page 1Published as a conference paper at ICLR 2017AS IMPLE BUT T OUGH - TO -B EAT B ASELINE FOR S EN ...

  6. FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?

    FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MIN ...

  7. (转)The AlphaGo Replication Wiki

    The AlphaGo Replication Wiki 摘自:https://github.com/Rochester-NRT/RocAlphaGo/wiki/01.-Home Contents : ...

  8. (转)分布式深度学习系统构建 简介 Distributed Deep Learning

    HOME ABOUT CONTACT SUBSCRIBE VIA RSS   DEEP LEARNING FOR ENTERPRISE Distributed Deep Learning, Part ...

  9. CIFAR-10 Competition Winners: Interviews with Dr. Ben Graham, Phil Culliton, & Zygmunt Zając

    CIFAR-10 Competition Winners: Interviews with Dr. Ben Graham, Phil Culliton, & Zygmunt Zając Dr. ...

随机推荐

  1. asp.net使用母版页以及Jquery和prototype要注意的问题

    在母版页中引用了js,css或者其他外部文件之后,子页面就不必再重新引用,否则可能出错 prototype.js和jquery.js冲突的解决方案: <script type="tex ...

  2. How to get the edited text from itext in fabricjs

    https://stackoverflow.com/questions/39286826/how-to-get-the-edited-text-from-itext-in-fabricjs http: ...

  3. 【使用 DOM】理解 DOM

    DOM(Document Object Model,文档对象模型)允许我们用 JavaScript 来探查和操作 HTML 文档里的内容.它对于创建丰富性内容而言是必不可少的一组功能. 1. 理解文档 ...

  4. 解决ListView在界面只显示一个item

    ListView只显示一条都是scrollview嵌套listView造成的,将listView的高度设置为固定高度之后,三个条目虽然都完全显示.但是这个地方是动态显示的,不能写死.故采用遍历各个子条 ...

  5. mysql binlog 使用

    用于数据恢复的binlog 前提条件 1.定时mysqldumps全备数据库 2.开启binlog增量备份 情景:手滑误操作删表操作 立刻 mysql>flush logs;  #开启一个新的b ...

  6. goruntine

    一.出让时间片 runtime.Gosched() 二.同步锁 Go语言包中的sync包提供了两种锁类型:sync.Mutex和sync.RWMutex.Mutex是最简单的一种锁类型,同时也比较暴力 ...

  7. Centos修改文件打开数限制

    一.查看系统限制最大打开数 cat /proc/sys/fs/file-max 还有一个问题是file-max最大能设置多大呢?一个经验算法是 256个fd 需4M内存.例如8G内存,8*1024/4 ...

  8. react-native AsyncStorage 数据持久化方案

    1,AsyncStorage介绍 AsyncStorage 是一个简单的.异步的.持久化的 Key-Value 存储系统,它对于 App 来说是全局性的.它用来代替 LocalStorage. 由于它 ...

  9. Oracle之索引(Index)实例解说 - 基础

    Oracle之索引(Index)实例解说 - 基础 索引(Index)是关系数据库中用于存放表中每一条记录位置的一种对象.主要目的是加快数据的读取速度和数据的完整性检查.索引的建立是一项技术性要求很高 ...

  10. C#趣味程序---水仙花数

    using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; usin ...