Why Deep Learning Works – Key Insights and Saddle Points

A quality discussion on the theoretical motivations for deep learning, including distributed representation, deep architecture, and the easily escapable saddle point.

By Matthew Mayo.

This post summarizes the key points of a recent blog post by Rinu Boney, based on a lecture by Dr. Yoshua Bengio from this year's Deep Learning Summer School in Montreal, which discusses the theoretical motivations for deep learning.


"To generalize locally, we need representative examples for all relevant variations."

Deep learning is about learning multiple levels of representations, corresponding to multiple levels of abstractions. If we are able to learn these multiple levels of representation, we are able to generalize well.

After setting the general tone of the post with the above (paraphrased) statement, the author presents a number of different artificial intelligence (AI) strategies, from rule-based systems to deep learning, and notes on at which levels of learning their components function. He then states the 3 keys to moving from machine learning (ML) to true artificial intelligence: Lots of data, very flexible models, and powerful priors, and that, since classical ML can handle the first 2, his post deals with the third.

On the path toward AI from today's ML systems, we need learning, generalization, ways to fight the curse of dimensionality, and the ability to disentangle the underlying explanatory factors. Before explaining why non-parametric learning algorithms won't get us to true AI, he gives a nuanced definition of non-parametric. He explains why smoothness, a classical non-parametric approach, won't work on high-dimensionality, and then provides the following insight re: dimensionality:

"If we dig deeper mathematically, it's not the number of dimensions but the number of variations of functions that we learn. In this case, smoothness is about how many ups and downs are present in the curve."


"A line is very smooth. A curve with some ups and downs is less smooth but still smooth."

So, it's clear that smoothness will not beat the curse of dimensionality alone. In fact, smoothness doesn't even apply to modern, complex problems like computer vision or natural language processing. After discussing the downfalls of such competing methods as Gaussian kernels, Boney sets his sights on moving past smoothness, and why that's necessary:

"We want to be non-parametric in the sense that we want the family of functions to grow in flexibility as we get more data. In neural networks, we change the number of hidden units depending on the amount of data."

He notes that in deep learning, 2 priors are used, namely distributed representations and deep architecture.

Why distributed representations?

"With distributed representations, it is possible to represent exponential number of regions with a linear number of parameters. The magic of distributed representation is that it can learn a very complicated function (with many ups and downs) with a low number of examples."

In distributed representations, features are individually and independently meaningful, and they remain so regardless of what the other features are. There maybe some interactions but most features are learned independent of each other. Boney states that neural networks are very good at learning representations capturing the semantic aspects, and that their generalization power is derived from these representations. As a practical exploration of the topic, he recommend's Cristopher Olah's article for some information on distributed representation and Natural Language Processing.

There is a lot of misunderstanding about what depth means.

"Deeper networks does not correspond to a higher capacity. Deeper doesn't mean we can represent more functions. If the function we are trying to learn has a particular characteristic obtained through composition of many operations, then it is much better to approximate these functions with a deep neural network."

Boney then comes full circle. He explains that one of the reasons neural network research was abandon (once again) in the late 90s was because the optimization problem is non-convex. The realization from the work in the 80s and 90s that neural networks have an exponential number of local minima, along with the breakout success of kernel machines, also led to this downfall, as did the fact that networks may get stuck on poor solutions. Recently we have evidence that the issue of non-convexity may be a non-issue, which changes its relationship vis-a-vis neural networks. 
 
"A saddle point is illustrated in the image above. In a global or local minima, all the directions are going up and in a global or local maxima, all the directions are going down."

Saddle Points.

"Let us consider the optimization problem in low dimensions vs high dimensions. In low dimensions, it is true that there exists lots of local minima. However in high dimensions, local minima are not really the critical points that are the most prevalent in points of interest. When we optimize neural networks or any high dimensional function, for most of the trajectory we optimize, the critical points(the points where the derivative is zero or close to zero) are saddle points. Saddle points, unlike local minima, are easily escapable."

The intuition with the saddle point, is that, for a minima located close to the global minima, all directions should be climbing upward; going further downward is not possible. Local minima exist, but are very close to global minima in terms of objective functions, and theoretical results suggest that some large functions have their probability concentrated between the index (the critical points) and the objective function. The index is the fraction of directions moving downward; for all values of index not 0 or 1 (local minima and maxima, respectively), then it is a saddle point.

Boney goes on to say that there has been empirical validation corroborating this relationship between index and objective function, and that, while there is no proof the results apply to neural network optimization, some evidence suggests that the observed behavior may well correspond to the theoretical results. Stochastic gradient descent, in practice, almost always escapes from surfaces that are not local minima.

This all suggests that local minima may not, in fact, be an issue because of saddle points.

Boney follows his saddle points discussion up by pointing out a few other priors that work with deep distributed representations; human learning, semi-supervised learning, and multi-task learning. He then lists a few related papers on saddle points.

Rinu Boney has written a detailed piece on the motivations for deep learning, including a good discussion on saddle points, all of which is difficult to do justice with a few quotes and some summarization. If you are interested in a deeper discussion of the above points to visit Boney's blog and read the insightful and well-written piece yourself.

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.

Related:

 


Most popular last 30 days

Most viewed last 30 days

  1. Top 5 arXiv Deep Learning Papers, Explained - Oct 1, 2015.
  2. Data Lake vs Data Warehouse: Key Differences - Sep 29, 2015.
  3. 60+ Free Books on Big Data, Data Science, Data Mining, Machine Learning, Python, R, and more - Sep 4, 2015.
  4. R vs Python for Data Science: The Winner is ... - May 26, 2015.
  5. R vs Python: head to head data analysis - Oct 13, 2015.
  6. 30 Cant miss Harvard Business Review articles on Data Science, Big Data and Analytics - Sep 30, 2015.
  7. 9 Must-Have Skills You Need to Become a Data Scientist - Nov 22, 2014.

Most shared last 30 days

  1. Top 5 arXiv Deep Learning Papers, Explained - Oct 1, 2015.
  2. 30 Cant miss Harvard Business Review articles on Data Science, Big Data and Analytics - Sep 30, 2015.
  3. 90+ Active Blogs on Analytics, Big Data, Data Mining, Data Science, Machine Learning - Oct 8, 2015.
  4. How Big Data Helps Build Smart Cities - Oct 16, 2015.
  5. Does Deep Learning Come from the Devil? - Oct 9, 2015.
  6. 5 steps to actually learn data science - Oct 6, 2015.
  7. R vs Python: head to head data analysis - Oct 13, 2015.

Why Deep Learning Works – Key Insights and Saddle Points的更多相关文章

  1. why deep learning works

    https://medium.com/towards-data-science/deep-learning-for-object-detection-a-comprehensive-review-73 ...

  2. Growing Pains for Deep Learning

    Growing Pains for Deep Learning Advances in theory and computer hardware have allowed neural network ...

  3. Decision Boundaries for Deep Learning and other Machine Learning classifiers

    Decision Boundaries for Deep Learning and other Machine Learning classifiers H2O, one of the leading ...

  4. Why deep learning?

    1. 深度学习中网络越深越好么? 理论上说是这样的,因为网络越深,参数也越多,拟合能力也越强(但实际情况是,网络很深的时候,不容易训练,使得表现能力可能并不好). 2. 那么,不同什么深度的网络,在参 ...

  5. Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works(笔记)

    注意:论文中,很多的地方出现baseline,可以理解为参照物的意思,但是在论文中,我们还是直接将它称之为基线,也 就是对照物,参照物. 这片论文中,作者没有去做实际的实验,但是却做了一件很有意义的事 ...

  6. (转)WHY DEEP LEARNING IS SUDDENLY CHANGING YOUR LIFE

    Main Menu Fortune.com       E-mail Tweet Facebook Linkedin Share icons By Roger Parloff Illustration ...

  7. The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near

    The Brain vs Deep Learning Part I: Computational Complexity — Or Why the Singularity Is Nowhere Near ...

  8. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  9. (转) Learning Deep Learning with Keras

    Learning Deep Learning with Keras Piotr Migdał - blog Projects Articles Publications Resume About Ph ...

随机推荐

  1. 加密算法使用(五):RSA使用全过程

    RSA是一种非对称加密算法,适应RSA前先生成一对公钥和私钥. 使用公钥加密的数据可以用私钥解密,同样私钥加密的数据也可以用公钥解密, 不同之处在于,私钥加密数据的同事还可以生成一组签名,签名是用来验 ...

  2. 目标检测——HOG特征

    1.HOG特征: 方向梯度直方图(Histogram of Oriented Gradient, HOG)特征是一种在计算机视觉和图像处理中用来进行物体检测的特征描述子.它通过计算和统计图像局部区域的 ...

  3. MVC3学习:利用mvc3+ajax实现登录

    用到的工具或技术:vs2010,EF code first,JQuery ajax,mvc3. 第一步:准备数据库. 利用EF code first,先写实体类,然后根据实体类自动创建数据库:或者先创 ...

  4. 【转】其实你不知道MultiDex到底有多坑

    遭遇MultiDex 愉快地写着Android代码的总悟君往工程里引入了一个默默无闻的jar然后Run了一下, 经过漫长的等待AndroidStudio构建失败了. 于是带着疑惑查看错误信息. UNE ...

  5. REST: C#调用REST API (zz)

    由于辞职的原因,最近正在忙于找工作.在这段期间收到了一家公司的上机测试题,一共两道题,其中一道题是关于REST API的应用.虽然在面试时,我已经说过,不懂REST,但那面试PM还是给了一道这题让我做 ...

  6. 系统级I/O学习记录

    重要知识点 输入/输出(I/O) I/O是主存和外部设备(如磁盘驱动器.终端和网络)之间拷贝数据的过程. 输入操作是从I/O设备拷贝数据到主存. 输出操作是从主存拷贝到I/O设备. Unix I/O ...

  7. ccpc-1008-HDU5839Special Tetrahedron-计算几何

    计算几何水题.暴力搞 注意力全部都在02那里,完全没想这道题! /*------------------------------------------------------------------ ...

  8. AngularJS——karma的安装

    1,前言: 刚刚学过了 grunt的安装以及使用,grunt的作用就是让我们平常不想做的任务能够自动化完成,并且可以自己 自定义任务,那么karma是什么呢? Karma是Testcular的新名字, ...

  9. mysql-分页查询方案

    一.直接使用limit最简单查询方法: , 在中小数据量的情况下,这样的SQL足够用了,唯一需要注意的问题就是确保使用了索引. 随着数据量的增加,页数会越来越多,查看后几页的SQL就可能类似: , 言 ...

  10. note.js之 Nodejs+Express4在windows下的配置

    本篇主要介绍一下在windows平台下采用nodejs+express4框架+Mongodb实现网站的开发.其实本人是不赞同在Windows平台下使用node.js进行开发,但由于公司后台工程师都是采 ...