https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network

It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize.

https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks/31842945

In the neural network terminology:

  • one epoch = one forward pass and one backward pass of all the training examples
  • batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
  • number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/

Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples.

Overview

Batch methods, such as limited memory BFGS, which use the full training set to compute the next update to parameters at each iteration tend to converge very well to local optima. They are also straight forward to get working provided a good off the shelf implementation (e.g. minFunc) because they have very few hyper-parameters to tune. However, often in practice computing the cost and gradient for the entire training set can be very slow and sometimes intractable on a single machine if the dataset is too big to fit in main memory. Another issue with batch optimization methods is that they don’t give an easy way to incorporate new data in an ‘online’ setting. Stochastic Gradient Descent (SGD) addresses both of these issues by following the negative gradient of the objective after seeing only a single or a few training examples. The use of SGD In the neural network setting is motivated by the high cost of running back propagation over the full training set. SGD can overcome this cost and still lead to fast convergence.

Stochastic Gradient Descent

The standard gradient descent algorithm updates the parameters θθ of the objective J(θ)J(θ) as,

θ=θ−α∇θE[J(θ)]θ=θ−α∇θE[J(θ)]

where the expectation in the above equation is approximated by evaluating the cost and gradient over the full training set. Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. The new update is given by,

θ=θ−α∇θJ(θ;x(i),y(i))θ=θ−α∇θJ(θ;x(i),y(i))

with a pair (x(i),y(i))(x(i),y(i)) from the training set.

What is the difference between iterations and epochs in Convolution neural networks?的更多相关文章

  1. Training (deep) Neural Networks Part: 1

    Training (deep) Neural Networks Part: 1 Nowadays training deep learning models have become extremely ...

  2. Online handwriting recognition using multi convolution neural networks

    w可以考虑从计算机的“机械性.重复性”特征去设计“低效的”算法. https://www.codeproject.com/articles/523074/webcontrols/ Online han ...

  3. (转)Understanding, generalisation, and transfer learning in deep neural networks

    Understanding, generalisation, and transfer learning in deep neural networks FEBRUARY 27, 2017   Thi ...

  4. [C6] Andrew Ng - Convolutional Neural Networks

    About this Course This course will teach you how to build convolutional neural networks and apply it ...

  5. 2017年计算语义相似度最新论文,击败了siamese lstm,非监督学习

    Page 1Published as a conference paper at ICLR 2017AS IMPLE BUT T OUGH - TO -B EAT B ASELINE FOR S EN ...

  6. FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?

    FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MIN ...

  7. (转)The AlphaGo Replication Wiki

    The AlphaGo Replication Wiki 摘自:https://github.com/Rochester-NRT/RocAlphaGo/wiki/01.-Home Contents : ...

  8. (转)分布式深度学习系统构建 简介 Distributed Deep Learning

    HOME ABOUT CONTACT SUBSCRIBE VIA RSS   DEEP LEARNING FOR ENTERPRISE Distributed Deep Learning, Part ...

  9. CIFAR-10 Competition Winners: Interviews with Dr. Ben Graham, Phil Culliton, & Zygmunt Zając

    CIFAR-10 Competition Winners: Interviews with Dr. Ben Graham, Phil Culliton, & Zygmunt Zając Dr. ...

随机推荐

  1. idea 去掉never used 提示

  2. 自己写的粗糙的Excel数据驱动Http接口测试框架(一)

    1.excel用例: 2.用例执行: @Testpublic void BindBank() throws Exception { String fileName = "src/main/j ...

  3. Unity3D 中脚本执行的先后顺序

    Unity3D本身自带有控制脚本执行先后顺序的方法: Edit ---> Project Settings ---> Script Execution Order  ---> 值越小 ...

  4. mysql常用命令和函数

    一.DROP IF EXISTS DROP FUNCTION IF EXISTS fun;DROP TABLE IF EXISTS table; 二.数据表1.建立表CREATE TABLE test ...

  5. jmap命令(Java Memory Map)的使用

    jmap的使用能够參考: 官方文档 http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html 和这篇博客 http:// ...

  6. 块设备驱动之NOR FLASH驱动

    转载请注明出处:http://blog.csdn.net/ruoyunliufeng/article/details/25240947 一.硬件原理 从原理图中我们能看到NOR FLASH有地址线,有 ...

  7. python——Container之字典(dict)详解

    字典(dictionary)是除列表以外python之中最灵活的内置数据结构类型.列表是有序的对象结合,字典是无序的对象集合.两者之间的区别在于:字典当中的元素是通过键来存取的,而不是通过偏移存取. ...

  8. bootstrap之PressKeyCode&&LongPressKeyCode

    PressKeyCode package io.appium.android.bootstrap.handler; import com.android.uiautomator.core.UiDevi ...

  9. 【BIEE】分析的解析机制

    今天使用BIEE时意外的发现个问题,BIEE在展示结果时候,是先进行排序,然后再展示.具体测试案例如下: 首先,存在如下数据: 在BIEE展示效果如下: 目前是根据O1,02,03,04,05,06, ...

  10. .net core 2.0小白笔记(一):开发运行环境搭建

    小白一枚,有任何不妥之处敬请指教 这里不讨论什么设计模式,什么架构,什么什么,就是入门,简单的入门,虽然能跨平台,但是这里还是在win的环境下进行,不扯的那么远 其实官网文档写的挺不错的了,就是偶尔有 ...