Unsupervised learning, attention, and other mysteries

Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.

Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martins on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed.

Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe that this model can solve the problem for some setting of its parameters. If the answer is no, then the model will not get great results, no matter how good its learning algorithm. If the answer is yes, then it’s only a matter of getting the data and training it. And this is, in some sense, the primary question. Can the model represent a good solution to the problem?

There is a compelling argument that large, deep neural networks should be able to represent very good solutions to perception problems. It goes like this: human neurons are slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second, then you should only need a very small number of massively-parallel steps in order to solve problems like vision and speech recognition. This is an old argument — I’ve seen a paper on this from the early 80s.

This suggests that if you train a large, deep neural network with 10 or 15 layers on something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPU, at a time when few people knew how to code for GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time.

Nowadays, everybody knows that if you want to solve a problem, you just need to get a lot of data and train a big neural net. You might not solve it perfectly, but you can definitely solve it better than you could have possibly solved it without deep learning.

DB: Not to trivialize what you’re saying, but you say throw a lot of data at a highly parallel system, and you’ll basically figure out what you need?

IS: Yes, but: although the system is highly parallel, it is its sequential nature that gives you the power. It’s true we use parallel systems because that’s the only way to make it fast and large. But if you think of what depth represents — depth is the sequential part.

And if you look at our networks, you will see that each year they are getting deeper. It’s amazing to me that these very vague, intuitive arguments turned out to correspond to what is actually happening. Each year the networks that do best in vision are deeper than they were before. Now we have 25-layer computational steps, or even more, depending on how you count.

DB: What are the open problems, theoretically, in making deep learning as successful as it can be?

IS: The huge open problem would be to figure out how you can do more with less data. How do you make this method less data-hungry? How can you input the same amount of data, but better formed?

This ties in with the one of greatest open problems in machine learning — unsupervised learning. How do you even think about unsupervised learning? How do you benefit from it? Once our understanding improves and unsupervised learning advances, this is where we will acquire new ideas, and see a completely unimaginable explosion of new applications.

DB: What’s our current understanding of unsupervised learning? And how is it limited in your view?

IS: Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You have a big model, and you’re using a lot of data to define the cost — the training error — which you minimize. If you have a lot of data, your training error will be close to your test error. Eventually, you get to a low test error, which is what you wanted from the start.

But I can’t even articulate what it is we want from unsupervised learning. You want something; you want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced that the explanation is right under our noses.

DB: Are you aware of any promising avenues that people are exploring toward a deeper, conceptual understanding of why unsupervised learning does what it does?

IS: There are plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner how to solve a particular problem, they will tell you to get the data and apply supervised learning. There is not yet an important application where unsupervised learning makes a profound difference.

DB: Do we have any sense of what success means? Even a rough measure of how well an unsupervised model performs?

IS: Unsupervised learning is always a means for some other end. In supervised learning, the learning itself is what you care about. You’ve got your cost function, which you want to minimize. In unsupervised learning, the goal is always to help some other task, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens here), then ask it to recognize objects with great accuracy (that’s the final supervised learning task).

Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy that would not be possible without the use of unsupervised learning. It’s a very measurable, very visible notion of success. And we haven’t achieved it yet.

DB: What are some other areas where you see exciting progress?

IS: A general direction that I believe to be extremely important is: are learning models capable of more sequential computations? I mentioned how I think that deep learning is successful because it can do more sequential computations than previous (“shallow”) models. And so models that can do even more sequential computation should be even more successful because they are able to express more intricate algorithms. It’s like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.

DB: And how do attention models differ from the current approach?

IS: In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network, but you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the input it wants to “look” at. Normally, if the input is very large, you need a large neural network to process it. But if you have an attention model, you can decide on the best size of the neural network, independent of the size of the input.

DB: So then, how do you decide where to focus this attention in the network?

IS: Say you have a sentence, a sequence of, say, 100 words. The attention model will issue a query on the input sentence and create a distribution over the input words, such that a word that is more similar to the query will have higher probability, and words that are less similar to the query will have lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the reason for its appeal and success.

DB: What kind of changes do you need to make to the framework itself? What new code do you need to insert this notion of attention?

IS: Well, the great thing about attention, at least differentiable attention, is that you don’t need to insert any new code to the framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that’s all you need.

DB: So, attention models address the question you asked earlier: how do we make better use of existing power with less data?

IS: That’s basically correct. There are many reasons to be excited about attention. One of them is that attention models simply work better, allowing us to achieve better results with less data. Also, bear in mind that humans clearly have attention. It is something that enables us to get results. It’s not just an academic concept. If you imagine a really smart system, surely, it, too, will have attention.

DB: What are some of the key issues around attention?

IS: Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the model’s operation. And this is fine when the input is a sentence that’s only, say, 100 words, but it’s not practical when the input is a 10,000-word document. So, one of the main issues is speed. Attention should be fast, but differentiable attention is not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning over thousands of objects would be non-trivial.

DB: Is there an analog, in the brain, as far as we know, for unsupervised learning?

IS: The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be unsupervised learning. In my opinion, the answer is unquestionably yes. Look at how people behave, and notice that people are not really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and you understand it, and all of a sudden you can do new things that you couldn’t do before. Consider a child, sitting in class. It’s not like the student is given a lot of input/output examples. The supervision is extremely indirect; so, there’s necessarily a lot of unsupervised learning going on.

DB: Your work was inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend into the realm of theorizing and applying machine learning?

IS: There is a lot of value of looking at the brain, but it has to be done carefully, and at the right level of abstraction. For example, our neural networks have units that have connections between them, and the idea of using slow interconnected processors was directly inspired by the brain. But it is a faint analogy.

Neural networks are designed to be computationally efficient in software implementations rather than biologically plausible. But the overall idea was inspired by the brain, and was successful. For example, convolutional neural networks echo our understanding that neurons in the visual cortex have very localized perceptive fields. This is something that was known about the brain, and this information has been successfully carried over to our models. Overall, I think there is value in studying the brain if done carefully and responsibly.

Public domain image on article and category pages via the Google Art Project on Wikimedia Commons.

Unsupervised learning, attention, and other mysteries的更多相关文章

Machine Learning Algorithms Study Notes(4)—无监督学习（unsupervised learning）
1 Unsupervised Learning 1.1 k-means clustering algorithm 1.1.1 算法思想 1.1.2 k-means的不足之处 1 ...
Unsupervised Learning: Use Cases
Unsupervised Learning: Use Cases Contents Visualization K-Means Clustering Transfer Learning K-Neare ...
Unsupervised Learning and Text Mining of Emotion Terms Using R
Unsupervised learning refers to data science approaches that involve learning without a prior knowle ...
Supervised Learning and Unsupervised Learning
Supervised Learning In supervised learning, we are given a data set and already know what our correc ...
Unsupervised learning无监督学习
Unsupervised learning allows us to approach problems with little or no idea what our results should ...
PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning --- 论文笔记
PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning ICLR 20 ...
131.005 Unsupervised Learning - Cluster | 非监督学习 - 聚类
@(131 - Machine Learning | 机器学习) 零. Goal How Unsupervised Learning fills in that model gap from the ...
Coursera 机器学习第8章（上） Unsupervised Learning 学习笔记
8 Unsupervised Learning8.1 Clustering8.1.1 Unsupervised Learning: Introduction集群(聚类)的概念.什么是无监督学习:对于无 ...
无监督学习(Unsupervised Learning)
无监督学习(Unsupervised Learning) 聚类无监督学习特点只给出了样本, 但是没有提供标签通过无监督学习算法给出的样本分成几个族(cluster), 分出来的类别不是我们自己规 ...

随机推荐

SSH 框架的心得
使用SSH框架做完了一个普通网站的前后台项目,成热写点心得,免得以后再入坑.其中使用 Strust2 2.3.33 + Spring 4.3.9 + Hibernate 5.2.10 eclipse ...
Web站点性能-微观手段
文章:网站性能优化百度百科:高性能Web站点文章:构建高性能WEB站点之吞吐率.吞吐量.TPS.性能测试
Alpha阶段敏捷冲刺 DAY5
一.举行站立式例会 1.今天我们利用晚上的时间开展了站立会议,总结了一下之前工作的问题,并且制定了明天的计划. 2.站立式会议照片二.团队报告 1.昨日已完成的工作 (1)改进了程序算法 (2)优化 ...
css3 flex属性flex-grow、flex-shrink、flex-basis学习笔记
最近在研究css3的flex.遇到的flex:1;这一块,很是很纠结,flex-grow.flex-shrink.flex-basis始终搞不清,最经搜集了大量的介绍,应该能算是明白了.网上大部分解释 ...
p2 关节
P2中使用Constraint及其子类表示关节,也就是将两个刚体按照指定的规则约束在一起,形成有规律的.相互限制的运动模拟.P2关节模拟中,两个刚体没有通过任何刚体连接,只是通过算法模拟出关节运动轨迹 ...
MySQL存储引擎InnoDB与Myisam
InnoDB与Myisam的六大区别 InnoDB与Myisam的六大区别 MyISAM InnoDB 构成上的区别: 每个MyISAM在磁盘上存储成三个文件.第一个文件的名字以表的名字开始,扩展名 ...
lock 默认公平锁还是非公平锁？公平锁是如何定义？如何实现
ReentrantLock的实现是基于其内部类FairSync(公平锁)和NonFairSync(非公平锁)实现的. 其可重入性是基于Thread.currentThread()实现的: 如果当前线程 ...
Java线程间怎么实现同步
1.Object#wait(), Object#notify()让两个线程依次执行 /** * 类AlternatePrintDemo.java的实现描述:交替打印 */ class NumberPr ...
jmeter 兼容bug 记录一笔
这个问题我也遇到过,然后网上搜到了这篇文章! 先说下问题: 我在做性能测试时,使用JMeter搞了100个并发,以100TPS的压力压测十分钟,但压力一直出现波动,而且出现波动时JMeter十分卡,如 ...
HDU4678_Mine
很有意思,很好的题目. 这样的,一个n*m的扫雷地图,告诉你哪些地方是有雷的.一个人如果点在了空白处,那么与其相邻的(八个方向)的数字以及空白都会递归地显示出来,如果点在数字上面,那么就只会显示这一个 ...

Unsupervised learning, attention, and other mysteries

Unsupervised learning, attention, and other mysteries

Unsupervised learning, attention, and other mysteries的更多相关文章

随机推荐

热门专题