Unsupervised learning, attention, and other mysteries

Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.

Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

  1. Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
  2. Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
  3. Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martins on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed.

Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe that this model can solve the problem for some setting of its parameters. If the answer is no, then the model will not get great results, no matter how good its learning algorithm. If the answer is yes, then it’s only a matter of getting the data and training it. And this is, in some sense, the primary question. Can the model represent a good solution to the problem?

There is a compelling argument that large, deep neural networks should be able to represent very good solutions to perception problems. It goes like this: human neurons are slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second, then you should only need a very small number of massively-parallel steps in order to solve problems like vision and speech recognition. This is an old argument — I’ve seen a paper on this from the early 80s.

This suggests that if you train a large, deep neural network with 10 or 15 layers on something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPU, at a time when few people knew how to code for GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time.

Nowadays, everybody knows that if you want to solve a problem, you just need to get a lot of data and train a big neural net. You might not solve it perfectly, but you can definitely solve it better than you could have possibly solved it without deep learning.

DB: Not to trivialize what you’re saying, but you say throw a lot of data at a highly parallel system, and you’ll basically figure out what you need?

IS: Yes, but: although the system is highly parallel, it is its sequential nature that gives you the power. It’s true we use parallel systems because that’s the only way to make it fast and large. But if you think of what depth represents — depth is the sequential part.

And if you look at our networks, you will see that each year they are getting deeper. It’s amazing to me that these very vague, intuitive arguments turned out to correspond to what is actually happening.  Each year the networks that do best in vision are deeper than they were before. Now we have 25-layer computational steps, or even more, depending on how you count.

DB: What are the open problems, theoretically, in making deep learning as successful as it can be?

IS: The huge open problem would be to figure out how you can do more with less data. How do you make this method less data-hungry? How can you input the same amount of data, but better formed?

This ties in with the one of greatest open problems in machine learning — unsupervised learning. How do you even think about unsupervised learning? How do you benefit from it? Once our understanding improves and unsupervised learning advances, this is where we will acquire new ideas, and see a completely unimaginable explosion of new applications.

DB: What’s our current understanding of unsupervised learning? And how is it limited in your view?

IS: Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You have a big model, and you’re using a lot of data to define the cost — the training error — which you minimize. If you have a lot of data, your training error will be close to your test error. Eventually, you get to a low test error, which is what you wanted from the start.

But I can’t even articulate what it is we want from unsupervised learning. You want something; you want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced that the explanation is right under our noses.

DB: Are you aware of any promising avenues that people are exploring toward a deeper, conceptual understanding of why unsupervised learning does what it does?

IS: There are plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner how to solve a particular problem, they will tell you to get the data and apply supervised learning. There is not yet an important application where unsupervised learning makes a profound difference.

DB: Do we have any sense of what success means? Even a rough measure of how well an unsupervised model performs?

IS: Unsupervised learning is always a means for some other end. In supervised learning, the learning itself is what you care about. You’ve got your cost function, which you want to minimize. In unsupervised learning, the goal is always to help some other task, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens here), then ask it to recognize objects with great accuracy (that’s the final supervised learning task).

Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy that would not be possible without the use of unsupervised learning. It’s a very measurable, very visible notion of success. And we haven’t achieved it yet.

DB: What are some other areas where you see exciting progress?

IS: A general direction that I believe to be extremely important is: are learning models capable of more sequential computations? I mentioned how I think that deep learning is successful because it can do more sequential computations than previous (“shallow”) models. And so models that can do even more sequential computation should be even more successful because they are able to express more intricate algorithms. It’s like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.

DB: And how do attention models differ from the current approach?

IS: In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network, but you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the input it wants to “look” at. Normally, if the input is very large, you need a large neural network to process it. But if you have an attention model, you can decide on the best size of the neural network, independent of the size of the input.

DB: So then, how do you decide where to focus this attention in the network?

IS: Say you have a sentence, a sequence of, say, 100 words. The attention model will issue a query on the input sentence and create a distribution over the input words, such that a word that is more similar to the query will have higher probability, and words that are less similar to the query will have lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the reason for its appeal and success.

DB: What kind of changes do you need to make to the framework itself? What new code do you need to insert this notion of attention?

IS: Well, the great thing about attention, at least differentiable attention, is that you don’t need to insert any new code to the framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that’s all you need.

DB: So, attention models address the question you asked earlier: how do we make better use of existing power with less data?

IS: That’s basically correct. There are many reasons to be excited about attention. One of them is that attention models simply work better, allowing us to achieve better results with less data. Also, bear in mind that humans clearly have attention. It is something that enables us to get results. It’s not just an academic concept. If you imagine a really smart system, surely, it, too, will have attention.

DB: What are some of the key issues around attention?

IS: Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the model’s operation. And this is fine when the input is a sentence that’s only, say, 100 words, but it’s not practical when the input is a 10,000-word document. So, one of the main issues is speed. Attention should be fast, but differentiable attention is not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning over thousands of objects would be non-trivial.

DB: Is there an analog, in the brain, as far as we know, for unsupervised learning?

IS: The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be unsupervised learning. In my opinion, the answer is unquestionably yes. Look at how people behave, and notice that people are not really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and you understand it, and all of a sudden you can do new things that you couldn’t do before. Consider a child, sitting in class. It’s not like the student is given a lot of input/output examples. The supervision is extremely indirect; so, there’s necessarily a lot of unsupervised learning going on.

DB: Your work was inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend into the realm of theorizing and applying machine learning?

IS: There is a lot of value of looking at the brain, but it has to be done carefully, and at the right level of abstraction. For example, our neural networks have units that have connections between them, and the idea of using slow interconnected processors was directly inspired by the brain. But it is a faint analogy.

Neural networks are designed to be computationally efficient in software implementations rather than biologically plausible. But the overall idea was inspired by the brain, and was successful. For example, convolutional neural networks echo our understanding that neurons in the visual cortex have very localized perceptive fields. This is something that was known about the brain, and this information has been successfully carried over to our models. Overall, I think there is value in studying the brain if done carefully and responsibly.

Public domain image on article and category pages via the Google Art Project on Wikimedia Commons.

Unsupervised learning, attention, and other mysteries的更多相关文章

  1. Machine Learning Algorithms Study Notes(4)—无监督学习(unsupervised learning)

    1    Unsupervised Learning 1.1    k-means clustering algorithm 1.1.1    算法思想 1.1.2    k-means的不足之处 1 ...

  2. Unsupervised Learning: Use Cases

    Unsupervised Learning: Use Cases Contents Visualization K-Means Clustering Transfer Learning K-Neare ...

  3. Unsupervised Learning and Text Mining of Emotion Terms Using R

    Unsupervised learning refers to data science approaches that involve learning without a prior knowle ...

  4. Supervised Learning and Unsupervised Learning

    Supervised Learning In supervised learning, we are given a data set and already know what our correc ...

  5. Unsupervised learning无监督学习

    Unsupervised learning allows us to approach problems with little or no idea what our results should ...

  6. PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning --- 论文笔记

    PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning   ICLR 20 ...

  7. 131.005 Unsupervised Learning - Cluster | 非监督学习 - 聚类

    @(131 - Machine Learning | 机器学习) 零. Goal How Unsupervised Learning fills in that model gap from the ...

  8. Coursera 机器学习 第8章(上) Unsupervised Learning 学习笔记

    8 Unsupervised Learning8.1 Clustering8.1.1 Unsupervised Learning: Introduction集群(聚类)的概念.什么是无监督学习:对于无 ...

  9. 无监督学习(Unsupervised Learning)

    无监督学习(Unsupervised Learning) 聚类无监督学习 特点 只给出了样本, 但是没有提供标签 通过无监督学习算法给出的样本分成几个族(cluster), 分出来的类别不是我们自己规 ...

随机推荐

  1. C语言的知识与能力予以自评

    看到一个问卷不错,拟作为第三次作业的部分内容. 你对自己的未来有什么规划?做了哪些准备?答:多学习几门生存技巧,首先先学会碰壁. 你认为什么是学习?学习有什么用?现在学习动力如何?为什么?答:学习是人 ...

  2. java读取xls和xlsx数据作为数据驱动来用

    java读取Excle代码 拿来可以直接使用 :针对xls 和 xlsx package dataProvider; import java.io.File; import java.io.FileI ...

  3. SpringCloud——服务网关

    1.背景 上篇博客<SpringCloud--Eureka服务注册和发现>中介绍了注册中心Eureka.服务提供者和服务消费者.这篇博客我们将介绍服务网关. 图(1) 未使用服务网关的做法 ...

  4. [转帖]Linux 的UTC 和 GMT

    1.问题 对于装有Windows和Linux系统的机器,进入Windows显示的时间和Linux不一致,Linux中的时间比Windows提前8个小时. 2.解决方法 修改/etc/default/r ...

  5. 使用vue的mixins混入实现对正在编辑的页面离开时提示

    mixins.ts import { Vue, Component, Watch } from "vue-property-decorator" Component.registe ...

  6. 剖析Vue原理&实现双向绑定MVVM-1

    本文能帮你做什么?1.了解vue的双向数据绑定原理以及核心代码模块2.缓解好奇心的同时了解如何实现双向绑定为了便于说明原理与实现,本文相关代码主要摘自vue源码, 并进行了简化改造,相对较简陋,并未考 ...

  7. TP中循环遍历

    循环遍历(重点) 在ThinkPHP中系统提供了2个标签来实现数组在模版中的遍历: volist标签.foreach标签. Volist语法格式: Foreach语法格式: 从上述的语法格式发现vol ...

  8. Ubuntu系统下adb devices 不能显示手机设备

    1. 查看usb设备,命令:lsusb 结果如下:Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub B ...

  9. SolrPerformanceFactors--官方文档

    原文地址:http://wiki.apache.org/solr/SolrPerformanceFactors Contents Schema Design Considerations indexe ...

  10. BZOJ 4034 树上操作(树的欧拉序列+线段树)

    刷个清新的数据结构题爽一爽? 题意: 有一棵点数为 N 的树,以点 1 为根,且树点有边权.然后有 M 个 操作,分为三种: 操作 1 :把某个节点 x 的点权增加 a . 操作 2 :把某个节点 x ...