Unsupervised learning, attention, and other mysteries

Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.

Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

  1. Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
  2. Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
  3. Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martins on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed.

Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe that this model can solve the problem for some setting of its parameters. If the answer is no, then the model will not get great results, no matter how good its learning algorithm. If the answer is yes, then it’s only a matter of getting the data and training it. And this is, in some sense, the primary question. Can the model represent a good solution to the problem?

There is a compelling argument that large, deep neural networks should be able to represent very good solutions to perception problems. It goes like this: human neurons are slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second, then you should only need a very small number of massively-parallel steps in order to solve problems like vision and speech recognition. This is an old argument — I’ve seen a paper on this from the early 80s.

This suggests that if you train a large, deep neural network with 10 or 15 layers on something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPU, at a time when few people knew how to code for GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time.

Nowadays, everybody knows that if you want to solve a problem, you just need to get a lot of data and train a big neural net. You might not solve it perfectly, but you can definitely solve it better than you could have possibly solved it without deep learning.

DB: Not to trivialize what you’re saying, but you say throw a lot of data at a highly parallel system, and you’ll basically figure out what you need?

IS: Yes, but: although the system is highly parallel, it is its sequential nature that gives you the power. It’s true we use parallel systems because that’s the only way to make it fast and large. But if you think of what depth represents — depth is the sequential part.

And if you look at our networks, you will see that each year they are getting deeper. It’s amazing to me that these very vague, intuitive arguments turned out to correspond to what is actually happening.  Each year the networks that do best in vision are deeper than they were before. Now we have 25-layer computational steps, or even more, depending on how you count.

DB: What are the open problems, theoretically, in making deep learning as successful as it can be?

IS: The huge open problem would be to figure out how you can do more with less data. How do you make this method less data-hungry? How can you input the same amount of data, but better formed?

This ties in with the one of greatest open problems in machine learning — unsupervised learning. How do you even think about unsupervised learning? How do you benefit from it? Once our understanding improves and unsupervised learning advances, this is where we will acquire new ideas, and see a completely unimaginable explosion of new applications.

DB: What’s our current understanding of unsupervised learning? And how is it limited in your view?

IS: Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You have a big model, and you’re using a lot of data to define the cost — the training error — which you minimize. If you have a lot of data, your training error will be close to your test error. Eventually, you get to a low test error, which is what you wanted from the start.

But I can’t even articulate what it is we want from unsupervised learning. You want something; you want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced that the explanation is right under our noses.

DB: Are you aware of any promising avenues that people are exploring toward a deeper, conceptual understanding of why unsupervised learning does what it does?

IS: There are plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner how to solve a particular problem, they will tell you to get the data and apply supervised learning. There is not yet an important application where unsupervised learning makes a profound difference.

DB: Do we have any sense of what success means? Even a rough measure of how well an unsupervised model performs?

IS: Unsupervised learning is always a means for some other end. In supervised learning, the learning itself is what you care about. You’ve got your cost function, which you want to minimize. In unsupervised learning, the goal is always to help some other task, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens here), then ask it to recognize objects with great accuracy (that’s the final supervised learning task).

Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy that would not be possible without the use of unsupervised learning. It’s a very measurable, very visible notion of success. And we haven’t achieved it yet.

DB: What are some other areas where you see exciting progress?

IS: A general direction that I believe to be extremely important is: are learning models capable of more sequential computations? I mentioned how I think that deep learning is successful because it can do more sequential computations than previous (“shallow”) models. And so models that can do even more sequential computation should be even more successful because they are able to express more intricate algorithms. It’s like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.

DB: And how do attention models differ from the current approach?

IS: In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network, but you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the input it wants to “look” at. Normally, if the input is very large, you need a large neural network to process it. But if you have an attention model, you can decide on the best size of the neural network, independent of the size of the input.

DB: So then, how do you decide where to focus this attention in the network?

IS: Say you have a sentence, a sequence of, say, 100 words. The attention model will issue a query on the input sentence and create a distribution over the input words, such that a word that is more similar to the query will have higher probability, and words that are less similar to the query will have lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the reason for its appeal and success.

DB: What kind of changes do you need to make to the framework itself? What new code do you need to insert this notion of attention?

IS: Well, the great thing about attention, at least differentiable attention, is that you don’t need to insert any new code to the framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that’s all you need.

DB: So, attention models address the question you asked earlier: how do we make better use of existing power with less data?

IS: That’s basically correct. There are many reasons to be excited about attention. One of them is that attention models simply work better, allowing us to achieve better results with less data. Also, bear in mind that humans clearly have attention. It is something that enables us to get results. It’s not just an academic concept. If you imagine a really smart system, surely, it, too, will have attention.

DB: What are some of the key issues around attention?

IS: Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the model’s operation. And this is fine when the input is a sentence that’s only, say, 100 words, but it’s not practical when the input is a 10,000-word document. So, one of the main issues is speed. Attention should be fast, but differentiable attention is not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning over thousands of objects would be non-trivial.

DB: Is there an analog, in the brain, as far as we know, for unsupervised learning?

IS: The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be unsupervised learning. In my opinion, the answer is unquestionably yes. Look at how people behave, and notice that people are not really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and you understand it, and all of a sudden you can do new things that you couldn’t do before. Consider a child, sitting in class. It’s not like the student is given a lot of input/output examples. The supervision is extremely indirect; so, there’s necessarily a lot of unsupervised learning going on.

DB: Your work was inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend into the realm of theorizing and applying machine learning?

IS: There is a lot of value of looking at the brain, but it has to be done carefully, and at the right level of abstraction. For example, our neural networks have units that have connections between them, and the idea of using slow interconnected processors was directly inspired by the brain. But it is a faint analogy.

Neural networks are designed to be computationally efficient in software implementations rather than biologically plausible. But the overall idea was inspired by the brain, and was successful. For example, convolutional neural networks echo our understanding that neurons in the visual cortex have very localized perceptive fields. This is something that was known about the brain, and this information has been successfully carried over to our models. Overall, I think there is value in studying the brain if done carefully and responsibly.

Public domain image on article and category pages via the Google Art Project on Wikimedia Commons.

Unsupervised learning, attention, and other mysteries的更多相关文章

  1. Machine Learning Algorithms Study Notes(4)—无监督学习(unsupervised learning)

    1    Unsupervised Learning 1.1    k-means clustering algorithm 1.1.1    算法思想 1.1.2    k-means的不足之处 1 ...

  2. Unsupervised Learning: Use Cases

    Unsupervised Learning: Use Cases Contents Visualization K-Means Clustering Transfer Learning K-Neare ...

  3. Unsupervised Learning and Text Mining of Emotion Terms Using R

    Unsupervised learning refers to data science approaches that involve learning without a prior knowle ...

  4. Supervised Learning and Unsupervised Learning

    Supervised Learning In supervised learning, we are given a data set and already know what our correc ...

  5. Unsupervised learning无监督学习

    Unsupervised learning allows us to approach problems with little or no idea what our results should ...

  6. PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning --- 论文笔记

    PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning   ICLR 20 ...

  7. 131.005 Unsupervised Learning - Cluster | 非监督学习 - 聚类

    @(131 - Machine Learning | 机器学习) 零. Goal How Unsupervised Learning fills in that model gap from the ...

  8. Coursera 机器学习 第8章(上) Unsupervised Learning 学习笔记

    8 Unsupervised Learning8.1 Clustering8.1.1 Unsupervised Learning: Introduction集群(聚类)的概念.什么是无监督学习:对于无 ...

  9. 无监督学习(Unsupervised Learning)

    无监督学习(Unsupervised Learning) 聚类无监督学习 特点 只给出了样本, 但是没有提供标签 通过无监督学习算法给出的样本分成几个族(cluster), 分出来的类别不是我们自己规 ...

随机推荐

  1. vs中如何使用NuGet

    在vs中如何打开NuGet? 1.工具→NuGet程序包管理器→程序包管理控制台 2.没有的话,就去  工具→扩展和更新   搜索nuget 如果你点击工具,没看到Nuget这些字样,请注意汉化名字为 ...

  2. Eclipse 如何安装反编译插件

    安装反编译插件 1.Help——Eclipse Marketplace 2.输入 Decompiler 搜索并安装此插件 3.根据提示无脑下一步,安装好,重启后(如果还是无法编译,需要把默认打开cla ...

  3. DWZ-JUI+UEditor第二次不显示,UEditor异步加载第二次不显示的解决方案

    使用UEditor-1.4.3中遇到第一次跳转到使用UEditor的界面后,编辑器加载正常,返回后第二次再跳转到这个界面就出现UEditor无法正常加载, 也没百度到答案,看UEditor源码,发现这 ...

  4. 主从复制redis

    编辑主服务器的配置文件 注释下面一项 # slaveof  192.168.10.1  6379 主从复制 一主可以有多从,支持链式连级 一主多从 1:修改从服务器的配置文件/etc/redis.co ...

  5. 服务器BMC(带外)

    服务器除了装linux,windows系统外,相应还有一个可通过网线(服务器默认带外地址--可改)连接具体厂商服务器的BMC(Baseboard Management Controller,基板管理控 ...

  6. ant 安装及基础教程 !

    这篇文章主要介绍了ant使用指南详细入门教程,本文详细的讲解了安装.验证安装.使用方法.使用实例.ant命令等内容,需要的朋友可以参考下   一.概述 ant 是一个将软件编译.测试.部署等步骤联系在 ...

  7. postgis_LayerTransform

    [转] postgis_LayerTransform 一个在postgis中结合中国国情,批量对数据进行加偏到百度坐标,高德谷歌的火星坐标,或者逆向纠偏 安装: 在postgresql-postgis ...

  8. 彻底解决Webpack打包慢的问题

    转载 这几天写腾讯实习生 Mini 项目的时候用上了 React 全家桶,当然同时引入了 Webpack 作为打包工具.但是开发过程中遇到一个很棘手的问题就是,React 加上 React-Route ...

  9. 第84天:jQuery动态创建表格

    jQuery动态创建表格 <!DOCTYPE html> <html lang="en"> <head> <meta charset=&q ...

  10. ubuntu成功安装搜狗输入法

    在安装之前,我们要先了解一个事实,那就是linux下安装软件和Windows是非常不同的,并不是简单地双击安装包就可以安装了.linux很多软件都有自己的一个依赖源,如果不先安装好这些依赖源,你是无法 ...