Machine Learning Week_5 Cost Function and BackPropagation
As for the back propagation algorithm, the formula given by the teacher is really useful.
But you don't understand why you're doing this, including what delta means. And the best way to do that is to actually compute a small neural network, using the chain rule for derivatives. Calculate each θ once. Then put them together to understand how to use vectorization implementation.
There are no meanings. There are just laws of arithmetic.
0 Neural Networks: Learning
In Week 5, you will be learning how to train Neural Networks. The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't work, this is what I usually turn to), and this week's videos explain the 'backpropagation' algorithm for training these models. In this week's programming assignment, you'll also get to implement this algorithm and see it work for yourself.
The Neural Network programming exercise will be one of the more challenging ones of this class. So please start early and do leave extra time to get it done, and I hope you'll stick with it until you get it to work! As always, if you get stuck on the quiz and programming assignment, you should post on the Discussions to ask for help. (And if you finish early, I hope you'll go there to help your fellow classmates as well.)-- by Andrew NG
1 Cost Function and BackPropagation
1.1 Cost Function
Let's first define a few variables that we will need to use:
L = total number of layers in the network
\(s_l\) = number of units (not counting bias unit) in layer l
K = number of output units/classes
Binary classification: y = 0 or y = 1, K=1;
Multi-class classification: K>=3;
\]
Recall that in neural networks, we may have many output nodes. We denote \(h_\Theta(x)_k\) as being a hypothesis that results in the \(k^{th}\) output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:
\]
For neural networks, it is going to be slightly more complicated:
\]
We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
With the explanation of the regularization part, the lectures are not as same as what theacher says. So I do some corrections.
Teacher
In the regularization part, Completely, we don't sum over the terms responding to where i is equal to 0. And so this is kinda like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string their values as zero. But this is just one possible convention, and even if you were to sum over i equals 0 up to Sl, it would work about the same and doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common. Corresponds to the formula above.
Lecture
\]
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.
Note:
the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i
1.2 Backpropagation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:
\]
That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we'll look at the equations we use to compute the partial derivative of J(Θ):
\]
To do so, we use the following algorithm:
Back propagation Algorithm
Given training set \(\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace\)
- Set \(\Delta^{(l)}_{i,j}\) := 0 for all (l,i,j), (hence you end up having a matrix full of zeros)
For training example t =1 to m:
Set \(a^{(1)} := x^{(t)}\)
Perform forward propagation to compute \(a^{(l)}\) for l=2,3,…,L
Using \(y^{(t)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(t)}\)
Where L is our total number of layers and \(a^{(L)}\) is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
- Compute \(\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}\) using \(\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})\)
The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by \(z^{(l)}\).
The g-prime derivative terms can also be written out as:
\]
- \(\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}\) or with vectorization, \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T\)
Hence we update our new \(\Delta\) matrix.
- \(D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)\) , if j≠0.
- \(D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}\) If j=0.
The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get \(\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)\)
1.3 Backpropagation Intuition
Note: [4:39, the last term for the calculation for \(z^3_1\) (three-color handwritten formula) should be \(a^2_2\) instead of \(a^2_1\). 6:08 - the equation for cost(i) is incorrect. The first term is missing parentheses for the log() function, and the second term should be \((1-y^{(i)})\log(1-h{_\theta}{(x^{(i)}}))\). 8:50 - \(\delta^{(4)} = y - a^{(4)}\) is incorrect and should be \(\delta^{(4)} = a^{(4)} - y\).]
Recall that the cost function for a neural network is:
\]
If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:
\]
Intuitively, \(\delta_j^{(l)}\) is the "error" for \(a^{(l)}_j\) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:
\]
Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some \(\delta_j^{(l)}\):

In the image above, to calculate \(\delta_2^{(2)}\), we multiply the weights \(\Theta_{12}^{(2)}\) and \(\Theta_{22}^{(2)}\) by their respective \(\delta\) values found to the right of each edge. So we get \(\delta_2^{(2)}\) = \(\Theta_{12}^{(2)}\) * \(\delta_1^{(3)}\) +\(\Theta_{22}^{(2)}\) * \(\delta_2^{(3)}\). To calculate every single possible \(\delta_j^{(l)}\), we could start from the right of our diagram. We can think of our edges as our \(\Theta_{ij}\). Going from right to left, to calculate the value of \(\delta_j^{(l)}\), you can just take the over all sum of each weight times the \(\delta\) it is coming from. Hence, another example would be \(\delta_2^{(3)}\) = \(\Theta_{12}^{(3)}\) * \(\delta_1^{(4)}\).
2 Backpropagation in Pratice
2.1 Implementation Note: Unrolling Parameters
With neural networks, we are working with sets of matrices:
\]
In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
To summarize:

2.2 Gradient Checking
Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:
\]
With multiple theta matrices, we can approximate the derivative with respect to \(Θ_j\) as follows:
\]
A small value for \({\epsilon}\) (epsilon) such as \({\epsilon = 10^{-4}}\), guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.
Hence, we are only adding or subtracting epsilon to the \(\Theta_j\) matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.
2.3 Random Initialization
When you're running an algorithm of gradient descent, or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta. So for the advanced optimization algorithm, it assumes you will pass it some initial value for the parameters theta.
optTheta = fminunc(@costFunction, initialTheta, options)

Is it possible to set the initial value of theta to the vector of all zeros.Whereas this worked okay when we were using logistic regression, initializing all of your parameters to zero actually does not work when you are trading on your own network.
神经网络是多个函数作用在一起,形成非线性的函数做出预测。隐藏层的每一个节点都对应一个不同的函数,也就是不同的参数。例如一个100个节点的隐藏层就计算了100个不同的参数。
一旦所有初始参数都相同,那么所有的隐藏层节点计算的函数只有一个。整个神经网络就从100个节点变成了一个节点,无论是前向传播还是反向传播,都只是一个函数,怎么更新都是一样的。
同时注意到,老师所讲的,神经网络的梯度下降常常会得到一个局部最优解而不是全局最优解,也就是说,代价函数不是一个convex function。这里很耐人寻味。因为逻辑回归的代价函数与神经网络的代价函数基本上是一样的,但是神经网络的代价函数里嵌套了多个逻辑回归函数。逻辑回归是一个convex function, 神经网络的就不是了。
Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:

Hence, we initialize each \(\Theta^{(l)}_{ij}\) to a random value between \([-\epsilon,\epsilon]\). Using the above formula guarantees that we get the desired bound. The same procedure applies to all the \(\Theta\)'s. Below is some working code you could use to experiment.
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.
(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)
2.4 Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
Number of input units = dimension of features \(x^{(i)}\)
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
Training a Neural Network
Randomly initialize the weights
Implement forward propagation to get \(h_\Theta(x^{(i)})\) for any \(x^{(i)}\)
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
When we perform forward and back propagation, we loop on every training example:
Training a Neural Network
for i = 1:m,
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L
The following image gives us an intuition of what is happening as we are implementing our neural network:

And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum. Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing.
This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values.
So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill.
And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum.
So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together.
Ideally, you want \(h_\Theta(x^{(i)}) \approx y^{(i)}\). This will minimize our cost function. However, keep in mind that \(J(\Theta)\) is not convex and thus we can end up in a local minimum instead.
3 Autonomous Driving
In this video, I'd like to show you a fun and historically important example of neural networks learning of using a neural network for autonomous driving. That is getting a car to learn to drive itself.
The video that I'll showed a minute was something that I'd gotten from Dean Pomerleau, who was a colleague who works out in Carnegie Mellon University out on the east coast of the United States. And in part of the video you see visualizations like this. And I want to tell what a visualization looks like before starting the video.

Down here on the lower left is the view seen by the car of what's in front of it. And so here you kinda see a road that's maybe going a bit to the left, and then going a little bit to the right.
And up here on top, this first horizontal bar shows the direction selected by the human driver. And this location of this bright white band that shows the steering direction selected by the human driver where you know here far to the left corresponds to steering hard left, here corresponds to steering hard to the right. And so this location which is a little bit to the left, a little bit left of center means that the human driver at this point was steering slightly to the left. And this second bar here corresponds to the steering direction selected by the learning algorithm and again the location of this sort of white band means that the neural network was here selecting a steering direction that's slightly to the left.
And in fact before the neural network starts leaning initially, you see that the network outputs a grey band, like a grey, like a uniform grey band throughout this region and sort of a uniform gray fuzz corresponds to the neural network having been randomly initialized. And initially having no idea how to drive the car. Or initially having no idea of what direction to steer in. And is only after it has learned for a while, that will then start to output like a solid white band in just a small part of the region corresponding to choosing a particular steering direction. And that corresponds to when the neural network becomes more confident in selecting a band in one particular location, rather than outputting a sort of light gray fuzz, but instead outputting a white band that's more constantly selecting one's steering direction.
Neural Network-Based Autonomous Driving. 1992 11 23
ALVINN is a system of artificial neural networks that learns to steer by watching a person drive. ALVINN is designed to control the NAVLAB 2, a modified Army Humvee who had put sensors, computers, and actuators for autonomous navigation experiments.
The initial step in configuring ALVINN is creating a network just here. During training, a person drives the vehicle while ALVINN watches. Once every two seconds, ALVINN digitizes a video image of the road ahead, and records the person's steering direction.
This training image is reduced in resolution to 30 by 32 pixels and provided as input to ALVINN's three layered network. Using the back propagation learning algorithm,ALVINN is training to output the same steering direction as the human driver for that image.
Initially the network steering response is random. After about two minutes of training the network learns to accurately imitate the steering reactions of the human driver. This same training procedure is repeated for other road types. After the networks have been trained the operator pushes the run switch and ALVINN begins driving.
Twelve times per second, ALVINN digitizes the image and feeds it to its neural networks. Each network, running in parallel, produces a steering direction, and a measure of its' confidence in its' response.
The steering direction, from the most confident network, in this network training for the one lane road, is used to control the vehicle.

Suddenly an intersection appears ahead of the vehicle. As the vehicle approaches the intersection the confidence of the lone lane network decreases. As it crosses the intersection and the two lane road ahead comes into view, the confidence of the two lane network rises.
When its' confidence rises the two lane network is selected to steer. Safely guiding the vehicle into its lane onto the two lane road.
So that was autonomous driving using the neural network. Of course there are more recently more modern attempts to do autonomous driving. There are few projects in the US and Europe and so on, that are giving more robust driving controllers than this, but I think it's still pretty remarkable and pretty amazing how instant neural network trained with backpropagation can actually learn to drive a car somewhat well.
Machine Learning Week_5 Cost Function and BackPropagation的更多相关文章
- Machine Learning/Introducing Logistic Function
Machine Learning/Introducing Logistic Function 打算写点关于Machine Learning的东西, 正好也在cnBlogs上新开了这个博客, 也就更新在 ...
- CheeseZH: Stanford University: Machine Learning Ex4:Training Neural Network(Backpropagation Algorithm)
1. Feedforward and cost function; 2.Regularized cost function: 3.Sigmoid gradient The gradient for t ...
- 白话machine learning之Loss Function
转载自:http://eletva.com/tower/?p=186 有关Loss Function(LF),只想说,终于写了 一.Loss Function 什么是Loss Function?wik ...
- machine learning 之 Neural Network 2
整理自Andrew Ng的machine learning 课程 week5. 目录: Neural network and classification Cost function Backprop ...
- Course Machine Learning Note
Machine Learning Note Introduction Introduction What is Machine Learning? Two definitions of Machine ...
- Machine Learning - 第5周(Neural Networks: Learning)
The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't ...
- [Machine Learning] 浅谈LR算法的Cost Function
了解LR的同学们都知道,LR采用了最小化交叉熵或者最大化似然估计函数来作为Cost Function,那有个很有意思的问题来了,为什么我们不用更加简单熟悉的最小化平方误差函数(MSE)呢? 我个人理解 ...
- machine learning(11) -- classification: advanced optimization 去求cost function最小值的方法
其它的比gradient descent快, 在某些场合得到广泛应用的求cost function的最小值的方法 when have a large machine learning problem, ...
- machine learning(10) -- classification:logistic regression cost function 和 使用 gradient descent to minimize cost function
logistic regression cost function(single example) 图像分布 logistic regression cost function(m examples) ...
- [machine learning] Loss Function view
[machine learning] Loss Function view 有关Loss Function(LF),只想说,终于写了 一.Loss Function 什么是Loss Function? ...
随机推荐
- 记录一次Ubuntu20.04死机经过!!!在Ubuntu下使用Chrome的“无痕式”窗口,如果打开标签页过多就会造成死机
这里要说的事情就是自己刚刚经历的事情,而且尝试了多次最后证明,在Ubuntu下使用Chrome的"无痕式"窗口,如果打开标签页过多就会造成死机. 如何在Ubuntu下安装Chrom ...
- DRM:清华提出无偏差的新类发现与定位新方法 | CVPR 2024
论文分析了现有的新类别发现和定位(NCDL)方法并确定了核心问题:目标检测器往往偏向已知的目标,忽略未知的目标.为了解决这个问题,论文提出了去偏差区域挖掘(DRM)方法,以互补的方式结合类无关RPN和 ...
- 【全】CSS动画大全之按钮【b】
效果预览 代码 <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> < ...
- 微信小程序wx.getUserInfo授权获取用户信息(头像、昵称)
这个接口只能获得一些非敏感信息,例如用户昵称,用户头像,经过用户授权允许获取的情况下即可获得用户信息,至于openid这些,需要调取wx.login来获取. index.wxml <!-- 当已 ...
- Android 获取当前获取焦点的组件
在Activity中,使用this.getCurrentFocus(),获取当前焦点所在的View, 再判断是否是EditText(可调整成其他组件),看个人需要再做特定的逻辑处理 String co ...
- Mac m1 安装 scrcpy
前提:已经安装 brew 1. 设定 HOMEBREW_BOTTLE_DOMAIN(不设定的时候 ,会遇到报错 Bottle missing, falling back to the default ...
- musl libc 与 glibc 在 .NET 应用程序中的兼容性
musl Linux 和 glibc 是两种不同的 C 标准库实现,它们在多个方面存在显著差异. 历史和使用情况: glibc 是较早且广泛使用的 C 标准库实现,具有较长的开发历史和广泛的社区支持. ...
- 17 Python异常处理(捕获异常、抛出异常、自定义异常)
本篇是 Python 系列教程第 17 篇,更多内容敬请访问我的 Python 合集 当我们编写代码时,可能会遇到各种各样的错误情况,比如除数为零.找不到文件.网络问题等等.为了优雅地处理这些问题,P ...
- 巧用PDF转Markdown插件,在扣子(Coze)手搓一个有趣好玩的AI Bot
近期,TextIn团队开发的PDF转Markdown插件已经上架Coze平台. 短短的时间内,已经有不少朋友愉快地和我们的工具开始玩耍.今天我们抛砖引玉,介(an)绍(li)几种PDF转Markdow ...
- C++ STL queue容器——队列
queue容器 基本概念 queue是一种**先进先出的数据结构,它有两个出口,queue容器允许从一端新增元素,从另一端移除元素. queue容器没有迭代器,所有元素进出都必须符合"先进先 ...