ON THE EVOLUTION OF MACHINE LEARNING: FROM LINEAR MODELS TO NEURAL NETWORKS

We recently interviewed Reza Zadeh (@Reza_Zadeh). Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics. More information on his website: http://reza-zadeh.com

Tell us a bit about your work at Stanford and Databricks

At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs.

On the more practical side, in the distributed algorithms class, we work with the Spark cluster programming environment. I spend at least half my time on Spark. So all the theory that I teach in regard to distributed algorithms and machine learning gets implemented and made concrete by Spark, and then put in the hands of thousands of industry and academic folks who use commodity clusters.

I started running MapReduce jobs at Google back in 2006, before Hadoop was really popular or even known; but MapReduce was already mature at Google. I was 18 at the time, and even then I could see clearly that this is something that the world needs outside of Google. So I spent a lot of time building and thinking about algorithms on top of MapReduce, and always worked to stay current, long after leaving Google. When Spark came along, it was nice that it was open-source and one could see its internals, and contribute to it. I felt like it was the right time to jump on board because the idea of an RDD was the right abstraction for much of distributed computing.

From your time at Google up to the present work you’re doing with Spark, you have had the chance to see some of the evolution of machine learning as it ties to distributed computing. Can you describe that evolution?

Machine learning has been through several transition periods starting in the mid 90’s. From 1995 – 2005, there was a lot of focus on natural language, search, and information retrieval. The machine learning tools were simpler than what we’re using today; they include things like logistic regression, SVMs (support vector machines), kernels with SVMs, and PageRank. Google became immensely successful using these technologies, building major success stories like Google News and the Gmail spam classifier using easy-to-distribute algorithms for ranking and text classification – using technologies that were already mature by the mid 90’s.

Then around 2005 neural networks started making a comeback. Neural networks are a technology from the 80’s –  some would even date them back to the 60’s – and they’ve become “retrocool” thanks to their important recent advances in computer vision. Computer vision makes very productive use of (convolutional) neural networks. As that fact has become better established, neural networks are making their way into other applications, creeping into areas like natural language processing and machine translation.

But there’s a problem: neural networks are probably the most challenging of all the mentioned models to distribute. Those earlier models have all had their training successfully distributed. We can use 100 machines and train a logistic regression or SVM without much hassle. But developing a distributed neural network learning setup has been more difficult.

So guess who’s done it successfully? The only organization so far is Google; they are the pioneers, yet again. It’s very much like the scene back in 2005 when Google published the MapReduce paper, and everyone scrambled to build the same infrastructure. Google managed to distribute neural networks, get more bang for their buck, and now everyone is wishing they were in the same situation. But they’re not.

Why is an SVM or logistic regression easier to distribute than a neural network?

First of all, evaluating an SVM is a lot easier. After you’ve learned an SVM model or logistic regression model — or any linear model — the actual evaluation is very fast. Say you built a spam classifier. A new email comes along; to classify it as spam or not it takes very little time, because it’s just one dot product (in linear algebra terms). When it comes to a neural network, you have to do a lot more computation — even after you have learned the model — to figure out the model’s output. And that’s not even the biggest problem.

A typical SVM might be happy with just a million parameters, but the smallest successful neural networks I’ve seen have around 6 million–and that’s the absolutely smallest. Another problem is that the training algorithms don’t benefit from much of optimization theory. Most of the linear models that we use have mathematical guarantees on when training is finished. They can guarantee when you have found the best model you’re going to find. But the optimization algorithms that exist for neural networks don’t afford such guarantees. You don’t know after you’ve trained a neural network whether, given your setup, this is the best model you could have found. So you’re left wondering if you would have a better model if you kept on training.

Do you think that level of optimization is theoretically possible, and we just haven’t gotten there yet?

The optimization algorithms that would be necessary touch on some of the biggest open problems in computer science, so I don’t think there’s much hope for machine learning researchers to try and find algorithms that are provably optimal for neural networks. They would have to really change neural networks in a drastic way. Neural networks are not difficult to optimize because of the shortcomings of machine learning researchers. It’s because optimization is very difficult; it hits that NP-hard problem that we have difficulty solving in all of computer science, not just machine learning.

Okay, so given those difficulties and constraints …

Those three things are the big problems for neural networks. We largely get past them by way of engineering and mathematical tricks, heuristics, adding more machines, and using new hardware (e.g., the use of GPUs).

Of those tricks, what do you think shows the most promise for being generally applicable across the industry?

GPUs have been tremendously successful in adding compute horsepower to neural networks, allowing us to get past many problems. As I said, you have to do a lot of computation to even evaluate a neural network. If you have a GPU, you can speed those computations many times over, because the same operations that go into drawing lots of pixels on your screen can be used to evaluate the neural network model very efficiently. That approach has been tremendously successful to the point where Nvidia —  a primarily hardware development company — has started building specialized software tools for people who do computer vision and deep learning for neural networks.

Nvidia’s big selling points are its graphics cards and its CUDA library. Researchers now buy the most expensive GPUs they can to take full advantage of Nvidia’s software. And new software packages are being built on top of what Nvidia offers.  Outside of Google, this is the biggest advance that we’ve seen — in that we can train these models with GPUs.

Meanwhile, Google is also talking about using GPUs. Jeff Dean gave a talk in myworkshop at the NIPS (Neural Information Processing Systems) Conference last December in which he mentioned that they are beginning to use GPUs. Up to now they have been CPU-bound, which means they need to use tens of machines to get results comparable to what others are getting with just one or two graphics cards.

As neural networks become more powerful, do you see them subsuming more and more of the work that used to be the bread and butter of linear methods?

I think so, yes. Actually that’s happening right now. There’s always this issue that linear models can only discriminate linearly. In order to get non-linearities involved, you would have to add or change features, which involves a lot of work. For example, computer vision scientists spent a decade developing and tuning these things called SIFT features, which enable image classification and other vision tasks using linear methods. But then neural networks came along and SIFT features became unnecessary; the neural network approach is to make features automatically as part of the training.

But I think it’s asking for too much to say neural networks can replace all feature construction techniques. I don’t think that will happen. There will always be a place for linear models and good human-driven feature engineering. Having said that, pretty much any researcher who has been to the NIPS Conference is beginning to evaluate neural networks for their application. Everyone is testing whether their application can benefit from the non-linearities that neural networks bring.

It’s not like we never had non-linear models before. We have had them–many of them. It’s just that the neural network model happens to be particularly powerful. It can really work for some applications, and so it’s worth trying. That’s what a lot of people are doing. And when they see successes, they write papers about them. So far, I’ve seen successes in speech recognition, in computer vision, and in machine translation. It is a very wide array of difficult tasks, so there is good reason to be excited.

Why is a neural network so powerful compared to the traditional linear and non-linear methods that have existed up until now?

When you have a linear model, every feature is either going to hurt or help whatever you are trying to score. That’s the assumption inherent in linear models. So the model might determine that if the feature is large, then it’s indicative of class 1; but if it’s small, it’s indicative of class 2. Even if you go all the way up to very large values of the feature, or down to very small values of the feature, you will never have a situation where you say, “In this interval, the feature is indicative of class 1; but in another interval it’s indicative of class 2.”

That’s too limited. Say you are analyzing images, looking for pictures of dogs. It might be that only a certain subset of a feature’s values indicate whether it is a picture of a dog, and the rest of the values for that pixel, or for that patch of an image, indicate another class. You can’t draw a line to define such a complex set of relationships. Non-linear models are much more powerful, but at the same time they’re much more difficult to train. Once again, you run into those hard problems from optimization theory. That’s why for a long while we thought that neural networks weren’t good enough, because they would over-fit, or they were too powerful.  We couldn’t do precise, guaranteed optimization on them. That’s why they (temporarily) vanished from the scene.

Neural networks were effectively dead for that period that I mentioned, from the mid-90’s to about 2005. The NIPS Conference had a decline (indeed a rapid decline) in papers that had the word “neural” in their titles. Ironically, if you had the word “neural” in your title at NIPS, you were less likely to have your paper accepted. And then we started figuring out slightly better algorithms to train these things and more data became available to make the training a little bit easier. Finally, neural networks were reborn as deep learning, and specifically used for computer vision, which is where the first big successes occurred.

Within neural network theory, there are multiple branches and approaches to computer learning. Can you summarize some of the key approaches?

By far the most successful approach has been a supervised approach where an older algorithm, called backpropagation, is used to build a neural network that has many different outputs.

Let’s look at a neural network construction that has become very popular, called Convolutional Neural Networks. The idea is that the machine learning researcher builds a model constructed of several layers, each of which handles connections from the previous layer in a different way.

In the first layer, you have a window that slides a patch across an image, which becomes the input for that layer. This is called a convolutional layer because the patch “convolves”, it overlaps with itself. Then several different types of layers follow. Each have different properties, and pretty much all of them introduce non-linearities.

The last layer has 10,000 potential neuron outputs; each one of those activations correspond to a particular label which identifies the image. The first class might be a cat; the second class might be a car; and so on for all the 10,000 classes that ImageNet has. If the first neuron is firing the most out of the 10,000 then the input is identified as belonging to the first class, a cat.

This kind of approach has been around for a long time, and it’s been very popular. Now we can train it with new training algorithms and accelerate it with GPUs. We can use libraries to build and mix and match different layers, so that we can see what kinds of neural networks work best. The drawback of the supervised approach is that you must apply labels to images while training. This is a car. This is a zoo. Etc.

Right. And the unsupervised approach?

A less popular approach involves “autoencoders”, which are unsupervised neural networks. Here the neural network is not used to classify the image, but to compress it. You read the image in the same way I just described, by identifying a patch and feeding the pixels into a convolutional layer. Several other layers then follow, including a middle layer which is very small compared to the others.  It has relatively few neurons. Basically you’re reading the image, going through a bottleneck, and then coming out the other side and trying to reconstruct the image.

No labels are required for this training, because all you are doing is putting the image at both ends of the neural network and training the network to make the image fit, especially in the middle layer. Once you do that, you are in possession of a neural network that knows how to compress images. And  it’s effectively giving you features that you can use in other classifiers. So if you have only a little bit of labeled training data, no problem — you always have a lot of images. Think of these images as non-labeled training data. You can use images to build an autoencoder, then from the autoencoder pull out features that are a good fit using a little bit of training data to find the neurons in your auto-encoded neural network that are susceptible to particular patterns.

This is what Google did. They effectively built an unsupervised network off of a large collection of images from the web; it helps that they have access to a big chunk of the world’s images. With this network, they were able to determine, for example, which neurons are susceptible to firing when a cat is in the image. And from that they now have a very good cat classifier. That’s a bit of a more “researchy” approach. You still need somelabeled data to make it work, and it would be pretty hard to do outside of Google, since it requires the scale to read a lot more images than most organizations can handle.

There are some other approaches as well. Machine translation is one example; although it also uses a supervised technique.

Of the two approaches I just described, the one that I’m primarily focused on is the first type, supervised learning via backpropagation. I believe this approach makes what you’re trying to learn clear, as opposed to having faith that an autoencoder will learn what you want. Plus, once you’re done training, the neural network becomes immediately useful. You don’t have to then find some auxiliary training data to identify which neurons are particularly sensitive to a particular class. So most of what I’m focusing on building on top of Spark leverages the supervised learning approach.

Speaking of Spark, can you provide a broad overview of the kinds of research that you published? Plus, what got you into to Spark? And where do you see that set of technologies heading?

Sure. I have published in distributed machine learning. And that obviously ties very nicely with Spark, because what I work on within Spark is the machine learning library. So whether it be writing code myself, reviewing code, or soliciting contributions from people who are experts in a particular machine learning area, I’m all over the board. My students spend time on this at Stanford. They spend time on the linear algebra primitives, because that’s something we’re particularly good at. I also work with folks in Berkeley; right now I’m trying to get some deep learning into Spark and neural networks, and that happens to be something that the folks in Berkeley and Stanford work on very actively.

As far as what got me into Spark — I’ve known Matei Zaharia, the creator of Spark, since we were both undergraduates at Waterloo. We would practice on the ACM programming team, back when I was a sophomore and he was a junior. And we actually interned at Google at the same time. He was working on developer productivity tools, completely unrelated to big data. He worked at Google and never touched MapReduce, which was my focus–kind of funny given where he ended up.

Then Matei went to Facebook, where he worked on Hadoop and became immensely successful. During that time, I kept thinking about distributing machine learning and none of the frameworks that were coming out — including Hadoop — looked exciting enough for me to build on top of because I knew from my time at Google what was really possible.

At the highest scale?

Yes, and I was underwhelmed by Hadoop. Not Hadoop as the ecosystem as it’s evolved to become these days, but by Hadoop MapReduce, in particular. I knew how to build machine learning on top of it, but I concluded that MapReduce presented an inefficient way to address those challenges, and so I wasn’t excited about trying that.

And then Spark came along. I first saw it in 2011, while I was still a student.  I was running a group meeting, lining up speakers at Stanford, and I invited Matei (while he was a student at Berkeley) to come tell us about Spark. This was long before it was a very well-known project. I got excited when I saw how easy it was to build machine learning algorithms that were fault tolerant and distributed using, say, eight lines of code. I was hooked. I realized there’s some future in this, and decided I was going to try and help him out with it.

Tell us a bit about what Spark is, how it works, and why it’s particularly useful for distributed machine learning.

Spark is a cluster computing environment that gives you a distributed vector that works similar to the vectors you’re used to programming with on a single machine. Spark provides a distributed vector. You can’t do everything you could with a regular vector; for example, you don’t have arbitrary random access via indices. But you can, for example, intersect two vectors; you can union; you can sort. You can do all kinds of things that you would expect from a regular vector. These distributed vectors are called RDDs (resilient distributed datasets).

One reason Spark makes machine learning easy is that it works by keeping some important parts of the data in memory as much as possible without writing to disk. In a distributed environment, a typical way to get fault resilience is to write to disk, to replicate a disk across the network three times using HDFS. An alternative way to get fault resilience is to remember how you built your data, which is what Spark does.

What makes this suitable for machine learning is that the data can come into memory and stay there. If it doesn’t fit in memory, that’s fine too. It will get paged on and off a disk as needed. But the point is while it can fit in memory, it will stay there. This benefits any process that will go through the data many times — and that’s most of machine learning. Almost every machine learning algorithm needs to go through the data tens, if not hundreds, of times.

So as memory technologies improve at the hardware level, Spark becomes proportionally more powerful.

Yes, memory is always getting cheaper, more available, and bigger. And that absolutely benefits Spark. Also SSDs (Solid State Drives) enable Spark, again because we do page to disk when data doesn’t fit in memory. If you have an SSD, that’s going to work much faster. More and more cluster computing environments are going to include SSDs. And SSDs are going to become faster themselves, so that will definitely be helpful.

Where do you see Spark vis-a-vis MapReduce? Is there a place for both of them for different kinds of workloads and jobs?

To be clear, Hadoop as an ecosystem is going thrive and be around for a long time. I don’t think the same is true for the MapReduce component of the Hadoop ecosystem.

With regard to MapReduce, to answer your question, no, I don’t think so. I honestly think that if you’re starting a new workload, it makes no sense to start in MapReduce unless you have an existing code base that you need to maintain. Other than that, there’s no reason. It’s kind of a silly thing to do MapReduce these days: it’s the difference between assembly and C++. It doesn’t make sense to write assembly code if you can write C++ code.

If you had to project out just a little bit, what do you see the biggest developments ahead on the timeline for Spark and the Spark ecosystem? And what are the things you’re most excited about?

Spark itself is pretty stable right now. The biggest changes and improvements that are happening right now and happening in the next couple years are in the libraries. The machine learning library, the graph processing library, the SQL library, and the streaming libraries are all being rapidly developed, and every single one of them has an exciting roadmap for the next two years at least. These are all features that I want, and it’s very nice to see that they can be easily implemented. I’m also excited about community-driven contributions that aren’t general enough to put into Spark itself, but that support Spark as a community-driven set of packages on http://spark-packages.org. I think those will also be very helpful to the long-tail of users.

Over time, I think Spark will become the de facto distribution engine on which we can build machine learning algorithms, especially at scale.

I really enjoyed this, Reza. Thanks so much for your time!

ON THE EVOLUTION OF MACHINE LEARNING: FROM LINEAR MODELS TO NEURAL NETWORKS的更多相关文章

  1. Machine Learning - 第5周(Neural Networks: Learning)

    The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't ...

  2. Machine Learning - 第4周(Neural Networks: Representation)

    Neural networks is a model inspired by how the brain works. It is widely used today in many applicat ...

  3. Machine Learning #Lab1# Linear Regression

    Machine Learning Lab1 打算把Andrew Ng教授的#Machine Learning#相关的6个实验一一实现了贴出来- 预计时间长度战线会拉的比較长(毕竟JOS的7级浮屠还没搞 ...

  4. CheeseZH: Stanford University: Machine Learning Ex1:Linear Regression

    (1) How to comput the Cost function in Univirate/Multivariate Linear Regression; (2) How to comput t ...

  5. A review of learning in biologically plausible spiking neural networks

    郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! Contents: ABSTRACT 1. Introduction 2. Biological background 2.1. Spik ...

  6. machine learning (2)-linear regression with one variable

    machine learning- linear regression with one variable(2) Linear regression with one variable = univa ...

  7. 【Coursera - machine learning】 Linear regression with one variable-quiz

    Question 1 Consider the problem of predicting how well a student does in her second year of college/ ...

  8. machine learning (3)---Linear Algebra Review

    Matrix Vector Multiplication 左边的矩阵向量相乘法比右边的更简洁而且计算高效 Matrix Matrix Multiplication 可以同时计算12个结果(4个房子面积 ...

  9. [Machine Learning] Probabilistic Graphical Models:二、Bayes Network Fundamentals(1、Semantics & Factorization)

    一.How to construct the dependency? 1.首字母即随机变量名称 2.I->G是更加复杂的模型,但Bayes里不考虑,因为Bayes只是无环图. 3.CPD = c ...

随机推荐

  1. DWZ集成的xhEditor编辑器浏览本地图片上传的设置

    有关xhEditor的文件上传配置官方文档链接:http://i.hdu.edu.cn/dcp/dcp/comm/xheditor/demos/demo08.html 一.xhEditor图片上传的配 ...

  2. C语言 详解多级指针与指针类型的关系

    //V推论①:指针变量的步长只与‘指针变量的值’的类型有关(指针的值的类型 == 指针指向数据的类型) //指针类型跟指针的值有关,指针是占据4个字节大小的内存空间,但是指针的类型却是各不相同的 // ...

  3. C语言 文件操作4--文件结构体FILE的理解以及缓冲区再讲

    //文件结构体FILE的理解以及缓冲区再讲 #include<stdio.h> #include<stdlib.h> //要点:文件结构 //struct _iobuf { / ...

  4. XXOOJL

    她的鞋子放在外面没拿进来很显眼,我们俩正抱在床上,刚做完什么也没穿,她也没擦.听到门外的脚步声.赶紧穿.她在我房间里没出声,但我父亲肯定看出来了.于是问我:母亲去哪了,然后他去找她. 太尴尬了,那晚我 ...

  5. C# 版本的 计时器类:精确到微秒 秒后保留一位小数 支持年月日时分秒带单位的输出

    class TimeCount { // 临时变量,存放当前类能表示的最大年份值 ; /// <summary> /// 获取毫秒能表示的最大年份数 /// </summary> ...

  6. Code First开发系列之管理并发和事务(转)

    转自:http://www.cnblogs.com/farb/p/ConcurrencyAndTransctionManagement.html 返回<8天掌握EF的Code First开发&g ...

  7. 利用Maven管理工程项目本地启动报错及解决方案

    目前利用Maven工具来构建自己的项目已比较常见.今天主要不是介绍Maven工具,而是当你本地启动这样的服务时,如果遇到报错,该如何解决?下面只是参考的解决方案,具体的解法还是得看log的信息. 1. ...

  8. Anaroid WebView 的属性汇总

    1. 打开网页时不调用系统浏览器, 而是在本WebView中显示: mWebView.setWebViewClient(new WebViewClient(){ @Override public bo ...

  9. window8配置IIS,搭建应用程序网站。

    这个里面的一定要勾选,不然会有莫名其妙的错误. 在应用程序池中,如果你是64位电脑,需勾选启用32位应用程序为 true

  10. 最短路径算法之Dijkstra算法(java实现)

    前言 Dijkstra算法是最短路径算法中为人熟知的一种,是单起点全路径算法.该算法被称为是“贪心算法”的成功典范.本文接下来将尝试以最通俗的语言来介绍这个伟大的算法,并赋予java实现代码. 一.知 ...