http://russellsstewart.com/notes/0.html

The following advice is targeted at beginners to neural networks, and is based
on my experience giving advice to neural net newcomers in industry and at
Stanford. Neural nets are fundamentally harder to debug than most programs,
because most neural net bugs don't result in type errors or runtime errors.
They just cause poor convergence. Especially when you're new, this can be very
frustrating! But an experienced neural net trainer will be able to
systematically overcome the difficulty in spite of the ubiquitous and
seemingly ambiguous error message: Performance Error: your neural net did not train well. To the uninitiated, the message is daunting. But to the experienced, this is a
great error. It means the boilerplate coding is out of the way, and it's time
to dig in!

How to deal with NaNs

By far the most common first question I get from students is, "Why am I
getting NaNs." Occasionally, this has a complicated answer. But most often,
the NaNs come in the first 100 iterations, and the answer is simple: your
learning rate is too high. When the learning rate is very high, you will get
NaNs in the first 100 iterations of training. Try reducing the learning rate
by a factor of 3 until you no longer get NaNs in the first 100 iterations. As
soon as this works, you'll have a pretty good learning rate to get started
with. In my experience, the best heavily validated learning rates are 1-10x
below the range where you get NaNs. If you are getting NaNs beyond the first 100 iterations, there are 2 further
common causes. 1) If you are using RNNs, make sure that you are using "gradient
clipping", which caps the global L2 norm of the gradients. RNNs tend to
produce gradients early in training where 10% or fewer of the batches have
learning spikes, where the gradient magnitude is very high. Without clipping,
these spikes can cause NaNs. 2) If you have written any custom layers
yourself, there is a good chance your own custom layer is causing the problems
in a division by zero scenario. Another notoriously NaN producing layer is
the softmax layer. The softmax computation involves an exp(x) term in both the
numerator and denominator, which can divide Inf by Inf and produce NaNs. Make
sure you are using a stabilized softmax implementation.

What to do when your neural net isn't learning anything

Once you stop getting NaNs, you are often rewarded with a neural net that runs
smoothly for many thousand iterations, but never reduces the training loss
after the initial fidgeting of the first few hundred iterations. When you're
first constructing your code base, waiting for more than 2000 iterations is
rarely the answer. This is not because all networks can start learning in
under 2000 iterations. Rather, the chance you've introduced a bug when coding
up a network from scratch is so high that you'll want to go into a special
early debugging mode before waiting on high iteration counts. The name of the
game here is to reduce the scope of the problem over and over again until you
have a network that trains in less than 2000 iterations. Fortunately, there
are always 2 good dimensions to reduce complexity. 1) Reduce the size of the training set to 10 instances. Working neural nets
can usually overfit to 10 instances within just a few hundred iterations. Many
coding bugs will prevent this from happening. If you're network is not able to
overfit to 10 instances of the training set, make sure your data and labels
are hooked up correctly. Try reducing the batch size to 1 to check for batch
computation errors. Add print statements throughout the code to make sure
things look like you expect. Usually, you'll be able to find these bugs
through sheer brute force. Once you can train on 10 instances, try training on
100. If this works okay, but not great, you're ready for the next step. 2) Solve the simplest version of the problem that you're interested in. If
you're translating sentences, try to build a language model for the target
language first. Once that works, try to predict the first word of the
translation given only the first 3 words of the source. If you're trying to
detect objects in images, try classifying the number of objects in each image
before training a regression network. There is a trade-off between getting
a good sub-problem you're sure the network can solve, and spending the
least amount of time plumbing the code to hook up the appropriate data.
Creativity will help here. The trick to scaling up a neural net for a new idea is to slowly relax the
simplifications made in the above two steps. This is a form of coordinate
ascent, and it works great. First, you show that the neural net can at least
memorize a few examples. Then you show that it's able to really generalize to
the validation set on a dumbed down version of the problem. You slowly up the
difficulty while making steady progress. It's not as fun as hotshotting it
the first time Karpathy style, but at least it works. At some point, you'll
find the problem is difficult enough that it can no longer be learned in 2000
iterations. That's great! But it should rarely take more than 10 times the
iterations of the previous complexity level of the problem. If you're finding
that to be the case, try to search for an intermediate level of complexity.

Tuning hyperparameters

Now that your networks is learning things, you're probably in pretty good
shape. But you may find that your network is just not capable of solving the
most difficult versions of your problem. Hyperparameter tuning will be key
here. Some people who just download a CNN package and ran it on their dataset
will tell you hyperparameter tuning didn't make a difference. Realize that
they're solving an existing problem with an existing architecture. If you're
solving a new problem that demands a new architecture, hyperparameter tuning
to get within the ballpark of a good setting is a must. You're best bet is
to read a hyperparameter tutorial for your specific problem, but I'll list
a few basic ideas here for completeness.
  • Visualization is key. Don't be afraid to take the time to write yourself nice visualization tools throughout training. If your method of visualization is watching the loss bump around from the terminal, consider an upgrade.
  • Weight initializations are important. Generally, larger magnitude initial weights are a good idea, but too large will get you NaNs. Thus, weight initialization will need to be simultaneously tuned with the learning rate.
  • Make sure the weights look "healthy". To learn what this means, I recommend opening weights from existing networks in an ipython notebook. Take some time to get used to what weight histograms should look like for your components in mature nets trained on standard datasets like ImageNet or the Penn Tree Bank.
  • Neural nets are not scale invariant w.r.t. inputs, especially when trained with SGD rather than second order methods, as SGD is not a scale-invariant method. Take the time to scale your input data and output labels in the same way that others before you have scaled them.
  • Decreasing your learning rate towards the end of training will almost always give you a boost. The best decay schedules usually take the form: after k epochs, divide the learning rate by 1.5 every n epochs, where k > n.
  • Use hyperparameter config files, although it's okay to put hyperparameters in the code until you start trying out different values. I use json files that I load in with a command line argument as in https://github.com/Russell91/tensorbox, but the exact format is not important. Avoid the urge to refactor your code as it becomes a hyperparameter loading mess! Refactors introduce bugs that cost you training cycles, and can be avoided until after you have a network you like.
  • Randomize your hyperparameter search if you can afford it. Random search generates hyperparmeter combinations you wouldn't have thought of and removes a great deal of effort once your intuition is already trained on how to think about the impact of a given hyperparameter.

Conclusion

Debugging neural nets can be more laborious than traditional programs because
almost all errors get projected onto the single dimension of overall network
performance. Nonetheless, binary search is still your friend. By alternately
1) changing the difficulty of your problem, and 2) using a small number of
training examples, you can quickly work through the initial bugs.
Hyperparameter tuning and long periods of diligent waiting will get you the
rest of the way.

Introduction to debugging neural networks的更多相关文章

  1. Introduction to Deep Neural Networks

    Introduction to Deep Neural Networks Neural networks are a set of algorithms, modeled loosely after ...

  2. cs231n spring 2017 lecture1 Introduction to Convolutional Neural Networks for Visual Recognition 听课笔记

    1. 生物学家做实验发现脑皮层对简单的结构比如角.边有反应,而通过复杂的神经元传递,这些简单的结构最终帮助生物体有了更复杂的视觉系统.1970年David Marr提出的视觉处理流程遵循这样的原则,拿 ...

  3. cs231n spring 2017 lecture1 Introduction to Convolutional Neural Networks for Visual Recognition

    1. 生物学家做实验发现脑皮层对简单的结构比如角.边有反应,而通过复杂的神经元传递,这些简单的结构最终帮助生物体有了更复杂的视觉系统.1970年David Marr提出的视觉处理流程遵循这样的原则,拿 ...

  4. 图解GNN:A Gentle Introduction to Graph Neural Networks

    1.图是什么? 本文给出得图的定义为:A graph represents the relations (edges) between a collection of entities (nodes) ...

  5. 【DeepLearning学习笔记】Coursera课程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning课堂笔记

    Coursera课程<Neural Networks and Deep Learning> deeplearning.ai Week1 Introduction to deep learn ...

  6. [C1W1] Neural Networks and Deep Learning - Introduction to Deep Learning

    第一周:深度学习引言(Introduction to Deep Learning) 欢迎(Welcome) 深度学习改变了传统互联网业务,例如如网络搜索和广告.但是深度学习同时也使得许多新产品和企业以 ...

  7. 课程一(Neural Networks and Deep Learning),第一周(Introduction to Deep Learning)—— 2、10个测验题

    1.What does the analogy “AI is the new electricity” refer to?  (B) A. Through the “smart grid”, AI i ...

  8. (转)A Recipe for Training Neural Networks

    A Recipe for Training Neural Networks Andrej Karpathy blog  2019-04-27 09:37:05 This blog is copied ...

  9. [C3] Andrew Ng - Neural Networks and Deep Learning

    About this Course If you want to break into cutting-edge AI, this course will help you do so. Deep l ...

随机推荐

  1. MySQL 数据库登录查询

    1. 进入到bin目录:   键入cd..,一直到出现C:\ 为止   然后cd bin所在路径:   如: C:\cd C:\Program Files\MySQL\MySQL Server 5.7 ...

  2. 机器学习技术点----apachecn的github地址

    预处理 离散化 等值分箱 等量分箱 独热 one-hot 标准化 最小最大 min-max z-score l2 标准化 归一化 特征选择 ANOVA 信息增益/信息增益率 模型验证 评价指标 回归 ...

  3. 10.8-uC/OS-III内部任务(中断处理任务 OS_IntQTask())

    1.当设置OS_CFG.H中的OS_CFG_ISR_POST_DEFERRED_EN为1时, uC/OS-III就会创建一个任务,它的作用是尽快完成ISR中对post函数的调用, 将信号量.消息等对象 ...

  4. arcgis api for javascript 距离与面积量算

    在之前的实验中,距离量算跟面积量算一直出问题,费了非常长的时间,各种调式找不到原因. 如今成功完毕.与君共勉 1.距离量算中        lengthParams.polylines = [geom ...

  5. RN九宫格

    九宫格可以用两种方式来做,一种使用SectionList,是我的另外一篇博客,还有一种的纯代码计算,下面是效果图 代码如下: var Dimensions = require('Dimensions' ...

  6. Windows 10正式版的历史版本

    1.Windows 10 1507 初版Windows 10,代号TH1,版本号10240,发布于2015年7月. 2015年7月29日,微软正式发布了Windows 10操作系统.Windows 1 ...

  7. wxPython:事件处理一

    事件处理是wxPython程序工作的基本机制,先看几个术语: 事件(event):应该程序期间发生的事情,要求有一个响应. 事件对象(event object):代表具体一个事件,包括事件的数据属性, ...

  8. GIT中常用的命令

    最近项目中使用到了GIT,所以记录一下GIT中常用的命令. GIT使用的客户端有Git Bash:http://code.google.com/p/msysgit/ 还有乌龟TortoiseGit:h ...

  9. 利用Python实现简单的相似图片搜索的教程

    大概五年前吧,我那时还在为一家约会网站做开发工作.他们是早期创业公司,但他们也开始拥有了一些稳定用户量.不像其他约会网站,这家公司向来以洁身自好为主要市场形象.它不是一个供你鬼混的网站——是让你能找到 ...

  10. CentOS里alias命令

    alias命令 功能描述:我们在进行系统的管理工作一定会有一些我们经常固定使用,但又很长的命令.那我们可以给这些这一长串的命令起一个别名.之后还需要这一长串命令时就可以直接以别名来替代了.系统中已经有 ...