Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?
1. the batch size, a : the number of training data feed into neural network once.
2. the number of iterations, b : how many time we should feed into NN. 
a*b = the number of training examples. 
  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here.  The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

  1. Batch gradient descent, B=|x|
  2. Online stochastic gradient descent: B=1
  3. Mini-batch stochastic gradient descent: B>1 but B<|x|.

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.

Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:

C=⌈T/B⌉ 

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with MM epochs is given below:

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate: ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):N→R 

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
 
 from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).
 
 

DTU DeepLearning: exercise 6的更多相关文章

  1. DTU DeepLearning: exercise 7

    torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...

  2. 【DeepLearning】Exercise:Softmax Regression

    Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...

  3. 【DeepLearning】Exercise:Convolution and Pooling

    Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...

  4. 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders

    Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...

  5. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  6. 【DeepLearning】Exercise:Self-Taught Learning

    Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...

  7. 【DeepLearning】Exercise:PCA and Whitening

    Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...

  8. 【DeepLearning】Exercise:PCA in 2D

    Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...

  9. 【DeepLearning】Exercise:Vectorization

    Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...

随机推荐

  1. 面试官:Java序列化为什么要实现Serializable接口?我懵了

    整理了一些Java方面的架构.面试资料(微服务.集群.分布式.中间件等),有需要的小伙伴可以关注公众号[程序员内点事],无套路自行领取 更多优选 一口气说出 9种 分布式ID生成方式,面试官有点懵了 ...

  2. NODEJS 使用 sqlite3 本地文件数据库

    npm install sqlite3 var sqlite3 = require('sqlite3').verbose();var db = new sqlite3.Database('WebFil ...

  3. Python 用户输入&while循环 初学者笔记

    input() 获取用户输入(获取的都是字符串哦) //函数input()让程序停止运行,等待用户输入一些文本. //不同于C的是可在input中添加用户提示,而scanf不具备这一特性. //提示超 ...

  4. Python语法速查: 14. 测试与调优

    返回目录 本篇索引 (1)测试的基本概念 (2)doctest模块 (3)unittest模块 (4)调试器和pdb模块 (5)程序探查 (6)调优与优化 (1)测试的基本概念 对程序的各个部分建立测 ...

  5. Vue(三)--循环语句

    v-for: v-for 指令需要以 site in sites 形式的特殊语法, sites 是源数据数组并且 site 是数组元素迭代的别名. demo1. <!DOCTYPE html&g ...

  6. Spark学习之路 (十)SparkCore的调优之Shuffle调优[转]

    概述 大多数Spark作业的性能主要就是消耗在了shuffle环节,因为该环节包含了大量的磁盘IO.序列化.网络数据传输等操作.因此,如果要让作业的性能更上一层楼,就有必要对shuffle过程进行调优 ...

  7. 《80x86汇编语言程序设计教程》第二章课后题答案

    2.5 习题 2.1 数据寄存器 1. 八个通用寄存器除了各自规定的专门用途外,它们均可以用于传送和暂存数据,可以保存算术逻辑运算中的各种操作数和运算结果. 2.1 AX和Al寄存器又称为累加器(ac ...

  8. Codeforces Round #620 (Div. 2) D

    构造一个排列,要求相邻之间的数满足给定的大小关系,然后构造出两个序列,一个序列是所有可能的序列中LIS最长的,一个所有可能的序列中LIS最短的 最短的构造方法:我们考虑所有单调递增的部分,可以发现要让 ...

  9. 问题 E: Problem B

    #include <cstdio> #include <cstring> #include <algorithm> #include <vector> ...

  10. [AFO]离开,未尝不是一个新的开始

    2019年5月18日 我注册了洛谷,提交了我的第一份AC代码-- A+B Problem 11月16日-17日 在短短6个月的仓促学习后,我从OI零基础走到了CSP-S的考场上 同年11月26日 在与 ...