DTU DeepLearning: exercise 6
- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
from here. The "sample size" you're talking about is referred to as batch size, BB. The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.
To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]
- Batch gradient descent, B=|x|
- Online stochastic gradient descent: B=1
- Mini-batch stochastic gradient descent: B>1 but B<|x|.
Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.
SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let x be our training set and let m⊂x. The batch size B is just the cardinality of m: B=|m|.
Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average of the gradients for a mini-batch m. (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent. E[∇LSGD(θ,m)]=∇L(θ,x).
Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.
Let's say that we have some data vector x:RD, an initial weight vector that parameterizes our neural network, θ0:RSθ0:RS, and a loss function L(θ,x):RS→RD→RS that we are trying to minimize. If we have TT training examples and a batch size of B, then we can split those training examples into C mini-batches:
For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.
An iterative algorithm for SGD with MM epochs is given below:
Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.
The major parameters for the vanilla (no momentum) SGD described above are:
- Learning Rate: ϵ
I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.
If you want to have the learning rate fixed, just define epsilon as a constant function.
- Batch Size
Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.
Citations & Further Reading:
DTU DeepLearning: exercise 6的更多相关文章
- DTU DeepLearning: exercise 7
torch activation functions: sigmoid, relu, tanh, softplus. https://morvanzhou.github.io/tutorials/ma ...
- 【DeepLearning】Exercise:Softmax Regression
Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = ...
- 【DeepLearning】Exercise:Convolution and Pooling
Exercise:Convolution and Pooling 习题链接:Exercise:Convolution and Pooling cnnExercise.m %% CS294A/CS294 ...
- 【DeepLearning】Exercise:Learning color features with Sparse Autoencoders
Exercise:Learning color features with Sparse Autoencoders 习题链接:Exercise:Learning color features with ...
- 【DeepLearning】Exercise: Implement deep networks for digit classification
Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks fo ...
- 【DeepLearning】Exercise:Self-Taught Learning
Exercise:Self-Taught Learning 习题链接:Exercise:Self-Taught Learning feedForwardAutoencoder.m function [ ...
- 【DeepLearning】Exercise:PCA and Whitening
Exercise:PCA and Whitening 习题链接:Exercise:PCA and Whitening pca_gen.m %%============================= ...
- 【DeepLearning】Exercise:PCA in 2D
Exercise:PCA in 2D 习题的链接:Exercise:PCA in 2D pca_2d.m close all %%=================================== ...
- 【DeepLearning】Exercise:Vectorization
Exercise:Vectorization 习题的链接:Exercise:Vectorization 注意点: MNIST图片的像素点已经经过归一化. 如果再使用Exercise:Sparse Au ...
随机推荐
- 阿里巴巴Java开发手册之并发处理注意事项
1. [强制]获取单例对象需要保证线程安全,其中的方法也要保证线程安全.说明:资源驱动类.工具类.单例工厂类都需要注意.2. [强制]创建线程或线程池时请指定有意义的线程名称,方便出错时回溯.正例:p ...
- 浏览器对象模型“BOM”,对浏览器窗口进行访问和操作
location对象 location.href url地址 location.hash 锚点 location.hostname 主机名(需要放到服务器上) location.ho ...
- form表单中给input 添加 数量可以增减的按钮
只需给input表单增加type=number即可
- Go 使用小记
1.不能使用在运行时计算的值实例化这样的数组. 而是使用make初始化具有所需长度的切片. db := ConnMysql() rows, err := db.Query("select r ...
- 3个N加上各种运算符号结果等于6(纯属娱乐)C#
网上的题目: 题有点难 但都有解 2 2 2 = 6 3 3 3 = 6 4 4 4 = 6 5 5 5 = 6 6 6 ...
- .NET Core 初次上手Swagger
安装NuGet 程序包=>Swashbuckle.AspNetCore 在 Startup.ConfigureServices 方法里添加注册生成器 //注册Swagger生成器,定义一个和 ...
- sublime 下载 和 破解
新增可用注册码,无需降级. Sublime Text 3.1 更改了验证方法,之前所有的验证码都已失效,建议降级到3143 版本. 新增3.1 3176 可用注册码 此验证码为sublime text ...
- 剑指offer-面试题14-剪绳子-贪婪算法
/* 题目: 给定一个长度为n的绳子,把绳子剪为m段,(n>1,m>1) 求各段绳子乘积的最大值. */ /* 思路: 贪婪算法. 当绳子的长度大于5时,尽可能多的剪长度为3的绳子:当剩下 ...
- IDEA如何快速搭建Java开发环境
作为IntelliJ IDEA mac新手,IDEA如何快速搭建Java开发环境呢?今天小编就给大家带来了IntelliJ IDEA mac使用教程,想知道IDEA如何快速搭建Java开发环境? 全局 ...
- 论文阅读笔记(九)【TIFS2020】:True-Color and Grayscale Video Person Re-Identification
Introduction (1)Motivation:在现实场景中,摄像头会因为故障呈现灰白色,或者为了节省视频的存储空间而人工设置为灰白色.灰度图像(grayscale images)由8位存储,而 ...