Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?
1. the batch size, a : the number of training data feed into neural network once.
2. the number of iterations, b : how many time we should feed into NN. 
a*b = the number of training examples. 
  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here.  The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

  1. Batch gradient descent, B=|x|
  2. Online stochastic gradient descent: B=1
  3. Mini-batch stochastic gradient descent: B>1 but B<|x|.

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.

Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:

C=⌈T/B⌉ 

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with MM epochs is given below:

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate: ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):N→R 

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
 
 from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).
 
 

DTU DeepLearning: exercise 6的更多相关文章

  1. DTU DeepLearning: exercise 7

    torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...

  2. 【DeepLearning】Exercise:Softmax Regression

    Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...

  3. 【DeepLearning】Exercise:Convolution and Pooling

    Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...

  4. 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders

    Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...

  5. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  6. 【DeepLearning】Exercise:Self-Taught Learning

    Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...

  7. 【DeepLearning】Exercise:PCA and Whitening

    Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...

  8. 【DeepLearning】Exercise:PCA in 2D

    Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...

  9. 【DeepLearning】Exercise:Vectorization

    Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...

随机推荐

  1. lucas定理 模板

    lucas定理 (nm)&VeryThinSpace;mod&VeryThinSpace;p=(⌊np⌋⌊mp⌋)(n&VeryThinSpace;mod&VeryTh ...

  2. 【Git】git使用 - 冲突conflict的解决演示

    冲突的解决 (如果git使用不熟练)建议在push不了时,pull之前.在本地创建一个新的分支并commit到local,以保证本地有commit记录,万一出什么问题,可以找回代码,以免代码丢失. ( ...

  3. 本地服务开启MySQL57提示本地计算机上MySQL服务启动后停止。。。。

    1.首先以管理员身份启动cmd,要不然服务禁止访问. 2.然后在cmd中输入 mysqld --remove mysql或者mysqld --remove mysql57来移除服务. 3.然后进入My ...

  4. 8组 上课啦(Class BUddy Pro)使用体验

    下载上课啦app 进入页面显示的是一个第1周的课表,和可以看到本周为第几周,点击周可以调整第几周显示课表 ,课表内容为整周内容,本周内容一目了然.点击右上角可以进入主设置页面,设置页面可以设置静音模式 ...

  5. PTA Is Topological Order

    Write a program to test if a give sequence Seq is a topological order of a given graph Graph. Format ...

  6. #6499. 「雅礼集训 2018 Day2」颜色 [分块,倍增,bitset]

    bitset压位,因为是颜色数,直接倍增,重合部分不管,没了. // powered by c++11 // by Isaunoya #include <bits/stdc++.h> #d ...

  7. react typescript react-router

    typescript 1 connected-react-router

  8. vue-infinite-loading 过滤器tab正确使用

    业务逻辑涉及loadmore,filter和tab切换,框架是vue,使用vue-infinite-loading中的一点经历. identifier 一开始并没有重视这个参数,只是他的官网说iden ...

  9. 自然语言处理(1)-HMM隐马尔科夫模型基础概念(一)

    隐马尔科夫模型HMM 序言 文本序列标注是自然语言处理中非常重要的一环,我先接触到的是CRF(条件随机场模型)用于解决相关问题,因此希望能够对CRF有一个全面的理解,但是由于在学习过程中发现一个算法像 ...

  10. [CF1311E] Construct the Binary Tree - 构造

    Solution 预处理出 \(i\) 个点组成的二叉树的最大答案和最小答案 递归做,由于只需要构造一种方案,我们让左子树大小能小就小,因此每次从小到大枚举左子树的点数并检验,如果检验通过就选定之 现 ...