Nuts and Bolts of Applying Deep Learning

Sep 26, 2016

This weekend was very hectic (catching up on courses and studying for a statistics quiz), but I managed to squeeze in some time to watch the Bay Area Deep Learning School livestream on YouTube. For those of you wondering what that is, BADLS is a 2-day conference hosted at Stanford University, and consisting of back-to-back presentations on a variety of topics ranging from NLP, Computer Vision, Unsupervised Learning and Reinforcement Learning. Additionally, top DL software libraries were presented such as Torch, Theano and Tensorflow.

There were some super interesting talks from leading experts in the field: Hugo Larochelle from Twitter, Andrej Karpathy from OpenAI, Yoshua Bengio from the Université de Montreal, and Andrew Ng from Baidu to name a few. Of the plethora of presentations, there was one somewhat non-technical one given by Andrew that really piqued my interest.

In this blog post, I’m gonna try and give an overview of the main ideas outlined in his talk. The goal is to pause a bit and examine the ongoing trends in Deep Learning thus far, as well as gain some insight into applying DL in practice.

By the way, if you missed out on the livestreams, you can still view them at the following: Day 1 and Day 2.

Table of Contents:

Major Deep Learning Trends

Why do DL algorithms work so well? According to Ng, with the rise of the Internet, Mobile and IOT era, the amount of data accessible to us has greatly increased. This correlates directly to a boost in the performance of neural network models, especially the larger ones which have the capacity to absorb all this data.

However, in the small data regime (left-hand side of the x-axis), the relative ordering of the algorithms is not that well defined and really depends on who is more motivated to engineer their features better, or refine and tune the hyperparameters of their model.

Thus this trend is more prevalent in the big data realm where hand engineering effectively gets replaced by end-to-end approaches and bigger neural nets combined with a lot of data tend to outperform all other models.

Machine Learning and HPC team. The rise of big data and the need for larger models has started to put pressure on companies to hire a Computer Systems team. This is because some of the HPC (high-performance computing) applications require highly specialized knowledge and it is difficult to find researchers and engineers with sufficient knowledge in both fields. Thus, cooperation from both teams is the key to boosting performance in AI companies.

Categorizing DL models. Work in DL can be categorized in the following 4 buckets:

Most of the value in the industry today is driven by the models in the orange blob (innovation and monetization mostly) but Andrew believes that unsupervised deep learning is a super-exciting field that has loads of potential for the future.

The rise of End-to-End DL

A major improvement in the end-to-end approach has been the fact that outputs are becoming more and more complicated. For example, rather than just outputting a simple class score such as 0 or 1, algorithms are starting to generate richer outputs: images like in the case of GAN’s, full captions with RNN’s and most recently, audio like in DeepMind’s WaveNet.

So what exactly does end-to-end training mean? Essentially, it means that AI practitioners are shying away from intermediate representations and going directly from one end (raw input) to the other end (output) Here’s an example from speech recognition.

Are there any disadvantages to this approach? End-to-end approaches are data hungry meaning they only perform well when provided with a huge dataset of labelled examples. In practice, not all applications have the luxury of large labelled datasets so other approaches which allow hand-engineered information and field expertise to be added into the model have gained the upper hand. As an example, in a self-driving car setting, going directly from the raw image to the steering direction is pretty difficult. Rather, many features such as trajectory and pedestrian location are calculated first as intermediate steps.

The main take-away from this section is that we should always be cautious of end-to-end approaches in applications where huge data is hard to come by.

Bias-Variance Tradeoff

Splitting your data. In most deep learning problems, train and test come from different distributions. For example, suppose you are working on implementing an AI powered rearview mirror and have gathered 2 chunks of data: the first, larger chunk comes from many places (could be partly bought, and partly crowdsourced) and the second, much smaller chunk is actual car data.

In this case, splitting the data into train/dev/test can be tricky. One might be tempted to carve the dev set out of the training chunk like in the first example of the diagram below. (Note that the chunk on the left corresponds to data mined from the first distribution and the one on the right to the one from the second distribution.)

This is bad because we usually want our dev and test to come from the same distribution. The reason for this is that because a part of the team will be spending a lot of time tuning the model to work well on the dev set, if the test set were to turn out very different from the dev set, then pretty much all the work would have been wasted effort.

Hence, a smarter way of splitting the above dataset would be just like the second line of the diagram. Now in practice, Andrew recommends creating dev sets from both data distributions: a train-dev and test-dev set. In this manner, any gap between the different errors can help you tackle the problem more clearly.

Flowchart for working with a model. Given what we have described above, here’s a simplified flowchart of the actions you should take when confronted with training/tuning a DL model.

The importance of data synthesis. Andrew also stressed the importance of data synthesis as part of any workflow in deep learning. While it may be painful to manually engineer training examples, the relative gain in performance you obtain once the parameters and the model fit well are huge and worth your while.

Human-level Performance

One of the very important concepts underlined in this lecture was that of human-level performance. In the basic setting, DL models tend to plateau once they have reached or surpassed human-level accuracy. While it is important to note that human-level performance doesn’t necessarily coincide with the golden bayes error rate, it can serve as a very reliable proxy which can be leveraged to determine your next move when training your model.

Reasons for the plateau. There could be a theoretical limit on the dataset which makes further improvement futile (i.e. a noisy subset of the data). Humans are also very good at these tasks so trying to make progress beyond that suffers from diminishing returns.

Here’s an example that can help illustrate the usefulness of human-level accuracy. Suppose you are working on an image recognition task and measure the following:

  • Train error: 8%
  • Dev Error: 10%

If I were to tell you that human accuracy for such a task is on the order of 1%, then this would be a blatant bias problem and you could subsequently try increasing the size of your model, train longer etc. However, if I told you that human-level accuracy was on the order of 7.5%, then this would be more of a variance problem and you’d focus your efforts on methods such as data synthesis or gathering data more similar to the test.

By the way, there’s always room for improvement. Even if you are close to human-level accuracy overall, there could be subsets of the data where you perform poorly and working on those can boost production performance greatly.

Finally, one might ask what is a good way of defining human-level accuracy. For example, in the following image diagnosis setting, ignoring the cost of obtaining data, how should one pick the criteria for human-level accuracy?

  • typical human: 5%
  • general doctor: 1%
  • specialized doctor: 0.8%
  • group of specialized doctors: 0.5%

The answer is always the best accuracy possible. This is because, as we mentioned earlier, human-level performance is a proxy for the bayes optimal error rate, so providing a more accurate upper bound to your performance can help you strategize your next move.

Personal Advice

Andrew ended the presentation with 2 ways one can improve his/her skills in the field of deep learning.

  • Practice, Practice, Practice: compete in Kaggle competitions and read associated blog posts and forum discussions.
  • Do the Dirty Work: read a lot of papers and try to replicate the results. Soon enough, you’ll get your own ideas and build your own models.
 

comments powered by Disqus

  • Kevin Zakka's Blog

Academic Journal

(转)Nuts and Bolts of Applying Deep Learning的更多相关文章

  1. 《Applying Deep Learning to Answer Selection: A Study And an Open Task》文章理解小结

    本篇论文是2015年的IBM watson团队的. 论文地址: 这是一篇关于QA问题的一篇论文: 相关论文讲解1.https://www.jianshu.com/p/48024e9f7bb22.htt ...

  2. Decision Boundaries for Deep Learning and other Machine Learning classifiers

    Decision Boundaries for Deep Learning and other Machine Learning classifiers H2O, one of the leading ...

  3. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  4. Deep learning:四十九(RNN-RBM简单理解)

    前言: 本文主要是bengio的deep learning tutorial教程主页中最后一个sample:rnn-rbm in polyphonic music. 即用RNN-RBM来model复调 ...

  5. (转)分布式深度学习系统构建 简介 Distributed Deep Learning

    HOME ABOUT CONTACT SUBSCRIBE VIA RSS   DEEP LEARNING FOR ENTERPRISE Distributed Deep Learning, Part ...

  6. (转)WHY DEEP LEARNING IS SUDDENLY CHANGING YOUR LIFE

    Main Menu Fortune.com       E-mail Tweet Facebook Linkedin Share icons By Roger Parloff Illustration ...

  7. Advice for applying Machine Learning

    https://jmetzen.github.io/2015-01-29/ml_advice.html Advice for applying Machine Learning This post i ...

  8. (转) Deep Learning in a Nutshell: Core Concepts

    Deep Learning in a Nutshell: Core Concepts Share:   Posted on November 3, 2015by Tim Dettmers 7 Comm ...

  9. (转)The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)

    Adit Deshpande CS Undergrad at UCLA ('19) Blog About The 9 Deep Learning Papers You Need To Know Abo ...

随机推荐

  1. 带宽bandwidth,也叫频宽

    1.两种意义 (1)在数字设备中,带宽通常以bps(bit per second)或bit/s或b/s表示. (2)在模拟设备中,带宽通常以每秒传送周期或赫兹 (Hz)来表示.如传送模拟信号(连续变化 ...

  2. sql server低版本到高版本还原,找不到备份集

    关键词:sql server低版本到高版本还原 故障问题,图中备份集(红色框线部分)没有数据,无法选择,导致无法还原 解决办法: [1] 低版本的备份到高版本的,用语句可以还原 注意事项: 低版本不一 ...

  3. Celery框架简单实例

    Python 中可以使用Celery框架 Celery框架是提供异步任务处理的框架,有两种用法,一种:应用程式发布任务消息,后台Worker监听执行,好处在于不影响应用程序继续执行.第二种,设置定时执 ...

  4. 006-spring cloud gateway-GatewayAutoConfiguration核心配置-GatewayProperties初始化加载、Route初始化加载

    一.GatewayProperties 1.1.在GatewayAutoConfiguration中加载 在Spring-Cloud-Gateway初始化时,同时GatewayAutoConfigur ...

  5. vue项目中px自动转换为rem

    .安装 postcss-pxtorem : npm install postcss-pxtorem -D .修改 /build/utils.js 文件 找到 postcssLoader const p ...

  6. Python 使用ctypes调用 C 函数

    在python中通过ctypes可以直接调用c的函数,非常简单易用 下面就一步一步解释用法吧,以Linux为例讲解. 1, 首先确定你的python支持不支持ctypes python2.7以后cty ...

  7. FireFox Plugin编程

    9 jiaofeng601, +479 9人支持,来自 Meteor.猪爪.hanyuxinting更多   本文通过多图组合,详细引导初学者开发NPAPI的浏览器插件. 如需测试开发完成的插件请参考 ...

  8. RAC禁用DRM特性

    查看"_gc"开头的隐藏参数值: set linesize 333 col name for a35 col description for a66 col value for a ...

  9. CSS兼容IE Firefox问题与解决方法

    一.双边距问题浮动元素的外边距会加倍,但与第一个浮动元素相邻的其他浮动元素外边距不会加倍.解决方法:在此浮动元素增加样式  display:inline; 二.图片产生的间隙父元素直接包含<im ...

  10. python isinstance()与type()的区别

    例如在继承上的区别: isinstance() 会认为子类是一种父类类型,考虑继承关系. type() 不会认为子类是一种父类类型,不考虑继承关系. class A: pass class B(A): ...