Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?
1. the batch size, a : the number of training data feed into neural network once.
2. the number of iterations, b : how many time we should feed into NN. 
a*b = the number of training examples. 
  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here.  The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

  1. Batch gradient descent, B=|x|
  2. Online stochastic gradient descent: B=1
  3. Mini-batch stochastic gradient descent: B>1 but B<|x|.

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.

Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:

C=⌈T/B⌉ 

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with MM epochs is given below:

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate: ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):N→R 

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
 
 from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).
 
 

DTU DeepLearning: exercise 6的更多相关文章

  1. DTU DeepLearning: exercise 7

    torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...

  2. 【DeepLearning】Exercise:Softmax Regression

    Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...

  3. 【DeepLearning】Exercise:Convolution and Pooling

    Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...

  4. 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders

    Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...

  5. 【DeepLearning】Exercise: Implement deep networks for digit classification

    Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...

  6. 【DeepLearning】Exercise:Self-Taught Learning

    Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...

  7. 【DeepLearning】Exercise:PCA and Whitening

    Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...

  8. 【DeepLearning】Exercise:PCA in 2D

    Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...

  9. 【DeepLearning】Exercise:Vectorization

    Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...

随机推荐

  1. webstrom 2019 注册码(可用 2019年10月14日08:59:18)

    K6IXATEF43-eyJsaWNlbnNlSWQiOiJLNklYQVRFRjQzIiwibGljZW5zZWVOYW1lIjoi5o6I5p2D5Luj55CG5ZWGOiBodHRwOi8va ...

  2. jQuery---委托事件原理

    jQuery事件发展历程 事件发展历程:从简单事件,到bind,到委托事件,到on事件绑定 //简单事件,给自己注册的事件 $("div").click(function () { ...

  3. JS 数组常见操作汇总,数组去重、降维、排序、多数组合并实现思路整理

    壹 ❀ 引 JavaScript开发中数组加工极为常见,其次在面试中被问及的概率也特别高,一直想整理一篇关于数组常见操作的文章,本文也算了却心愿了. 说在前面,文中的实现并非最佳,实现虽然有很多种,但 ...

  4. 【spring boot】SpringBoot初学(2.1) - properties读取明细

    前言 算是对<SpringBoot初学(2) - properties配置和读取>的总结吧. 概念性总结 一.Spring Boot允许外化(externalize)你的配置.可以使用pr ...

  5. JMeter-显示调试日志log

    JMeter-调试日志记录 参考文档:https://jmeter.apache.org/usermanual/hints_and_tips.html 大多数测试元素包括调试日志记录. 如果从GUI运 ...

  6. CTF长久练习平台

    0x01 XCTF(攻防世界) 攻防世界是ctf爱好者很喜欢的一个平台,不仅是界面风格像大型游戏闯关,里面的各类题目涵盖的ctf题型很广,还分为新手区和进阶区两块: 并且可以在里面组队,做一道题还有相 ...

  7. WPF实现高仿统计标题卡

    飘哇~~~,在家数瓜子仁儿,闲来无事,看东看西,也找点儿,最近正在看看WPF动画,光看也是不行,需要带着目的去学习,整合知识碎片,恰巧,看到Github中一个基于Ant Designer设计风格的后台 ...

  8. stylelint

    "number-leading-zero": "never", // 去掉小数点前面的0 "prettier.stylelintIntegration ...

  9. unicode 地址

    unicode  地址

  10. 541-反转字符串 II

    541-反转字符串 II 给定一个字符串和一个整数 k,你需要对从字符串开头算起的每个 2k 个字符的前k个字符进行反转.如果剩余少于 k 个字符,则将剩余的所有全部反转.如果有小于 2k 但大于或等 ...