Unsupervised learning, attention, and other mysteries

Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.

Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.

Key Takeaways:

  1. Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
  2. Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
  3. Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.

David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?

Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martins on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed.

Taking a step back, when solving naturally occurring machine learning problems, you use some model. The fundamental question is whether you believe that this model can solve the problem for some setting of its parameters. If the answer is no, then the model will not get great results, no matter how good its learning algorithm. If the answer is yes, then it’s only a matter of getting the data and training it. And this is, in some sense, the primary question. Can the model represent a good solution to the problem?

There is a compelling argument that large, deep neural networks should be able to represent very good solutions to perception problems. It goes like this: human neurons are slow, and yet humans can solve perception problems extremely quickly and accurately. If humans can solve useful problems in a fraction of a second, then you should only need a very small number of massively-parallel steps in order to solve problems like vision and speech recognition. This is an old argument — I’ve seen a paper on this from the early 80s.

This suggests that if you train a large, deep neural network with 10 or 15 layers on something like vision, then you could basically solve it. Motivated by this belief, I worked with Alex Krizhevsky toward demonstrating it. Alex had written an extremely fast implementation of 2D convolutions on a GPU, at a time when few people knew how to code for GPUs. We were able to train neural networks larger than ever before and achieve much better results than anyone else at the time.

Nowadays, everybody knows that if you want to solve a problem, you just need to get a lot of data and train a big neural net. You might not solve it perfectly, but you can definitely solve it better than you could have possibly solved it without deep learning.

DB: Not to trivialize what you’re saying, but you say throw a lot of data at a highly parallel system, and you’ll basically figure out what you need?

IS: Yes, but: although the system is highly parallel, it is its sequential nature that gives you the power. It’s true we use parallel systems because that’s the only way to make it fast and large. But if you think of what depth represents — depth is the sequential part.

And if you look at our networks, you will see that each year they are getting deeper. It’s amazing to me that these very vague, intuitive arguments turned out to correspond to what is actually happening.  Each year the networks that do best in vision are deeper than they were before. Now we have 25-layer computational steps, or even more, depending on how you count.

DB: What are the open problems, theoretically, in making deep learning as successful as it can be?

IS: The huge open problem would be to figure out how you can do more with less data. How do you make this method less data-hungry? How can you input the same amount of data, but better formed?

This ties in with the one of greatest open problems in machine learning — unsupervised learning. How do you even think about unsupervised learning? How do you benefit from it? Once our understanding improves and unsupervised learning advances, this is where we will acquire new ideas, and see a completely unimaginable explosion of new applications.

DB: What’s our current understanding of unsupervised learning? And how is it limited in your view?

IS: Unsupervised learning is mysterious. Compare it to supervised learning. We know why supervised learning works. You have a big model, and you’re using a lot of data to define the cost — the training error — which you minimize. If you have a lot of data, your training error will be close to your test error. Eventually, you get to a low test error, which is what you wanted from the start.

But I can’t even articulate what it is we want from unsupervised learning. You want something; you want the model to understand, whatever that means. Although we currently understand very little about unsupervised learning, I am also convinced that the explanation is right under our noses.

DB: Are you aware of any promising avenues that people are exploring toward a deeper, conceptual understanding of why unsupervised learning does what it does?

IS: There are plenty of people trying various ideas, mostly related to density modeling or generative models. If you ask any practitioner how to solve a particular problem, they will tell you to get the data and apply supervised learning. There is not yet an important application where unsupervised learning makes a profound difference.

DB: Do we have any sense of what success means? Even a rough measure of how well an unsupervised model performs?

IS: Unsupervised learning is always a means for some other end. In supervised learning, the learning itself is what you care about. You’ve got your cost function, which you want to minimize. In unsupervised learning, the goal is always to help some other task, like classification or categorization. For example, I might ask a computer system to passively watch a lot of YouTube videos (so unsupervised learning happens here), then ask it to recognize objects with great accuracy (that’s the final supervised learning task).

Successful unsupervised learning enables the subsequent supervised learning algorithm to recognize objects with accuracy that would not be possible without the use of unsupervised learning. It’s a very measurable, very visible notion of success. And we haven’t achieved it yet.

DB: What are some other areas where you see exciting progress?

IS: A general direction that I believe to be extremely important is: are learning models capable of more sequential computations? I mentioned how I think that deep learning is successful because it can do more sequential computations than previous (“shallow”) models. And so models that can do even more sequential computation should be even more successful because they are able to express more intricate algorithms. It’s like allowing your parallel computer to run for more steps. We already see the beginning of this, in the form of attention models.

DB: And how do attention models differ from the current approach?

IS: In the current approach, you take your input vector and give it to the neural network. The neural network runs it, applies several processing stages to it, and then gets an output. In an attention model, you have a neural network, but you run the neural network for much longer. There is a mechanism in the neural network, which decides which part of the input it wants to “look” at. Normally, if the input is very large, you need a large neural network to process it. But if you have an attention model, you can decide on the best size of the neural network, independent of the size of the input.

DB: So then, how do you decide where to focus this attention in the network?

IS: Say you have a sentence, a sequence of, say, 100 words. The attention model will issue a query on the input sentence and create a distribution over the input words, such that a word that is more similar to the query will have higher probability, and words that are less similar to the query will have lower probability. Then you take the weighted average of them. Since every step is differentiable, we can train the attention model where to look with backpropagation, which is the reason for its appeal and success.

DB: What kind of changes do you need to make to the framework itself? What new code do you need to insert this notion of attention?

IS: Well, the great thing about attention, at least differentiable attention, is that you don’t need to insert any new code to the framework. As long as your framework supports element-wise multiplication of matrices or vectors, and exponentials, that’s all you need.

DB: So, attention models address the question you asked earlier: how do we make better use of existing power with less data?

IS: That’s basically correct. There are many reasons to be excited about attention. One of them is that attention models simply work better, allowing us to achieve better results with less data. Also, bear in mind that humans clearly have attention. It is something that enables us to get results. It’s not just an academic concept. If you imagine a really smart system, surely, it, too, will have attention.

DB: What are some of the key issues around attention?

IS: Differentiable attention is computationally expensive because it requires accessing your entire input at each step of the model’s operation. And this is fine when the input is a sentence that’s only, say, 100 words, but it’s not practical when the input is a 10,000-word document. So, one of the main issues is speed. Attention should be fast, but differentiable attention is not fast. Reinforcement learning of attention is potentially faster, but training attentional control using reinforcement learning over thousands of objects would be non-trivial.

DB: Is there an analog, in the brain, as far as we know, for unsupervised learning?

IS: The brain is a great source of inspiration if looked at correctly. The question of whether the brain does unsupervised learning or not, depends to some extent on what you consider to be unsupervised learning. In my opinion, the answer is unquestionably yes. Look at how people behave, and notice that people are not really using supervised learning at all. Humans never use any supervision of any kind. You start reading a book, and you understand it, and all of a sudden you can do new things that you couldn’t do before. Consider a child, sitting in class. It’s not like the student is given a lot of input/output examples. The supervision is extremely indirect; so, there’s necessarily a lot of unsupervised learning going on.

DB: Your work was inspired by the human brain and its power. How far does the neuroscientific understanding of the brain extend into the realm of theorizing and applying machine learning?

IS: There is a lot of value of looking at the brain, but it has to be done carefully, and at the right level of abstraction. For example, our neural networks have units that have connections between them, and the idea of using slow interconnected processors was directly inspired by the brain. But it is a faint analogy.

Neural networks are designed to be computationally efficient in software implementations rather than biologically plausible. But the overall idea was inspired by the brain, and was successful. For example, convolutional neural networks echo our understanding that neurons in the visual cortex have very localized perceptive fields. This is something that was known about the brain, and this information has been successfully carried over to our models. Overall, I think there is value in studying the brain if done carefully and responsibly.

Public domain image on article and category pages via the Google Art Project on Wikimedia Commons.

Unsupervised learning, attention, and other mysteries的更多相关文章

  1. Machine Learning Algorithms Study Notes(4)—无监督学习(unsupervised learning)

    1    Unsupervised Learning 1.1    k-means clustering algorithm 1.1.1    算法思想 1.1.2    k-means的不足之处 1 ...

  2. Unsupervised Learning: Use Cases

    Unsupervised Learning: Use Cases Contents Visualization K-Means Clustering Transfer Learning K-Neare ...

  3. Unsupervised Learning and Text Mining of Emotion Terms Using R

    Unsupervised learning refers to data science approaches that involve learning without a prior knowle ...

  4. Supervised Learning and Unsupervised Learning

    Supervised Learning In supervised learning, we are given a data set and already know what our correc ...

  5. Unsupervised learning无监督学习

    Unsupervised learning allows us to approach problems with little or no idea what our results should ...

  6. PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning --- 论文笔记

    PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning   ICLR 20 ...

  7. 131.005 Unsupervised Learning - Cluster | 非监督学习 - 聚类

    @(131 - Machine Learning | 机器学习) 零. Goal How Unsupervised Learning fills in that model gap from the ...

  8. Coursera 机器学习 第8章(上) Unsupervised Learning 学习笔记

    8 Unsupervised Learning8.1 Clustering8.1.1 Unsupervised Learning: Introduction集群(聚类)的概念.什么是无监督学习:对于无 ...

  9. 无监督学习(Unsupervised Learning)

    无监督学习(Unsupervised Learning) 聚类无监督学习 特点 只给出了样本, 但是没有提供标签 通过无监督学习算法给出的样本分成几个族(cluster), 分出来的类别不是我们自己规 ...

随机推荐

  1. python、Eclipse、pydev环境配置

    转载来源:http://www.cnblogs.com/Bonker/p/3584707.html 编辑器: Eclipse + pydev插件: 1. Eclipse是写JAVA的IDE, 这样就可 ...

  2. 基于 IBM WAS ND v6.1 搭建稳定高效的集群环境

    如今的电子商务及电子政务应用系统的发展已经到了一个新的阶段,应用系统的成熟度和可用性都达到了更高的水准.因此庞大的部署规模和海量的用户访问成为目前大型电子商务及电子政务应用系统的显著特征.在这样的情况 ...

  3. (转)Elasticsearch search-guard 插件部署

    我之前写了ELK+shield的部署文档,由于shield是商业收费的,很多人都推崇开源项目search-guard来做ELK的安全组件,准确来说是elasticsearch的安全组件.search- ...

  4. 为何php curl post模式发送数据速度变慢了?我来说说原因

    事例: 今天要向一台服务器上传文件,原版是curl的get模式,现在改用了post模式,按照原本的思想,代码如下 <?php $post['c'] = 'config'; $post['t'] ...

  5. IF与SWITCH

    今晚刚刚看了一点儿<大话设计模式>这本书,看到它示例的第一个程序,好像有点理解IF与SWITCH的区别了.大致的思考了总结出来. IF适用于每个条件都必须判断,就是IF语句中的判断是不同类 ...

  6. 我们为什么要使用Spring Cloud?

    我们为什么要使用Spring Cloud? 两个需要好好看看: Spring Boot Spring Clude Spring Cloud是一个集成了众多开源的框架,利用Spring Boot的开发便 ...

  7. BZOJ 1965 洗牌(扩展欧几里得)

    容易发现,对于牌堆里第x张牌,在一次洗牌后会变成2*x%(n+1)的位置. 于是问题就变成了求x*2^m%(n+1)=L,x在[1,n]范围内的解. 显然可以用扩展欧几里得求出. # include ...

  8. Ubuntu 18.04开发环境部署流程

    部署流程 安装系统 安装Eclipse和jre 配置系统 安装辅助工具 安装系统 用安装盘安装即可. 一般boot 1G,swap按内存大小,home 20G,根剩余. 安装Eclipse和jre 解 ...

  9. webpack打包css样式出错

    有两个组件home和search 两个组件中都有class为footer的元素 但是search的footer比home的多一条background的样式 本地开发的时候没问题,但是打包之后,home ...

  10. TCP/IP四层协议模型与ISO七层模型

    TCP/IP四层协议模型与ISO七层模型 在世界上各地,各种各样的电脑运行着各自不同的操作系统为大家服务,这些电脑在表达同一种信息的时候所使用的方法是千差万别.就好像圣经中上帝打乱了各地人的口音,让他 ...