DTU DeepLearning: exercise 6
- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
from here. The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.
To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]
- Batch gradient descent, B=|x|
- Online stochastic gradient descent: B=1
- Mini-batch stochastic gradient descent: B>1 but B<|x|.
Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.
SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.
Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).
Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.
Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:
For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.
An iterative algorithm for SGD with MM epochs is given below:
Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.
The major parameters for the vanilla (no momentum) SGD described above are:
- Learning Rate: ϵ
I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.
If you want to have the learning rate fixed, just define epsilon as a constant function.
- Batch Size
Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.
Citations & Further Reading:
DTU DeepLearning: exercise 6的更多相关文章
- DTU DeepLearning: exercise 7
torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...
- 【DeepLearning】Exercise:Softmax Regression
Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...
- 【DeepLearning】Exercise:Convolution and Pooling
Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...
- 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders
Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...
- 【DeepLearning】Exercise: Implement deep networks for digit classification
Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...
- 【DeepLearning】Exercise:Self-Taught Learning
Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...
- 【DeepLearning】Exercise:PCA and Whitening
Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...
- 【DeepLearning】Exercise:PCA in 2D
Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...
- 【DeepLearning】Exercise:Vectorization
Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...
随机推荐
- matlab仿真随机数的产生
概率论和数理统计实验(matlab中实现) 一.伯努利分布 R=binornd(N,P); //N,P为二次分布的俩个参数,返回服从参数为N,P的二项分布的随机数,且N,P,R的形式相同. R=bin ...
- python数据类型(第三弹)
本文着重介绍python语言的两种数据类型——列表和元组 列表 相比于整型.浮点型等数据类型,列表是一个复合数据类型,它更像一个容器,可以容纳多种不同类型的数据. 如上图:列表a中装进去了字符串&qu ...
- 分布式任务调度XXL-JOB初体验
简介 XXL-JOB是一个轻量级分布式任务调度平台,其核心设计目标是开发迅速.学习简单.轻量级.易扩展.现已开放源代码并接入多家公司线上产品线,开箱即用. 官方文档很完善,不多赘述.本文主要是搭建XX ...
- SpringBoot 教程之 profile 的应用
目录 区分环境的配置 区分环境的代码 激活 profile 示例源码 参考资料 一个应用为了在不同的环境下工作,常常会有不同的配置,代码逻辑处理.Spring Boot 对此提供了简便的支 ...
- opencv —— 官方 示例程序
OpenCV 官方提供的示例程序,具体位于...\opencv\sources\samples\cpp 目录下. ...\opencv\sources\samples\cpp\tutorial_cod ...
- CentOS 7 版本配置salt-master salt-minion
下载saltshaker_api.git [root@linux-node1 salt]# cd $HOME [root@linux-node1 salt]# git clone https://gi ...
- Pycharm控制台乱码问题
PS:如我般的小白都会遇到中文乱码问题,那么怎么解决呢?其实非常简单,鼠标点点就好,请看下面: 如下乱码: 解决方法: 按如下步骤File→Settings→Editor→File Encodings ...
- Linux学习笔记:磁盘分区
本文更新于2019-12-30.操作系统为Debian 8.9 (jessie). 以下假设新磁盘为/dev/sdb,要创建一个分区/dev/sdb1,文件系统类型为xfs.请根据实际情况,自行选择. ...
- PAT (Advanced Level) Practice 1152 Google Recruitment (20 分)
In July 2004, Google posted on a giant billboard along Highway 101 in Silicon Valley (shown in the p ...
- JDK下载安装与环境变量配置图文教程【超详细】
JDK下载安装与环境变量配置图文教程[超详细] 创建时间:2019年11月13日11时02分 文章目录 1. JDK介绍 1.1 什么是JDK? 1.2 JDK版本介绍 2. JDK下载与安装 3.w ...