In machine learning, is more data always better than better algorithms?
In machine learning, is more data always better than better algorithms?
No. There are times when more data helps, there are times when it doesn't.
Probably one of the most famous quotes defending the power of data is that of Google's Research Director Peter Norvig claiming that "We don’t have better algorithms. We just have more data.". This quote is usually linked to the article on "The Unreasonable Effectiveness of Data", co-authored by Norvig himself (you should probably be able to find the pdf on the web although the originalis behind the IEEE paywall). The last nail on the coffin of better models is when Norvig is misquoted as saying that "All models are wrong, and you don't need them anyway" (read here for the author's own clarifications on how he was misquoted).
The effect that Norvig et. al were referring to in their article, had already been captured years before in the famous paper by Microsoft Researchers Banko and Brill [2001] "Scaling to Very Very Large Corpora for Natural Language Disambiguation". In that paper, the authors included the plot below.
That figure shows that, for the given problem, very different algorithms perform virtually the same. however, adding more examples (words) to the training set monotonically increases the accuracy of the model.
So, case closed, you might think. Well... not so fast. The reality is that both Norvig's assertions and Banko and Brill's paper are right... in a context. But, they are now and again misquoted in contexts that are completely different than the original ones. But, in order to understand why, we need to get slightly technical. (I don't plan on giving a full machine learning tutorial in this post. If you don't understand what I explain below, read my answer to How do I learn machine learning?)
Variance or Bias?
The basic idea is that there are two possible (and almost opposite) reasons a model might not perform well.
In the first case, we might have a model that is too complicated for the amount of data we have. This situation, known ashigh variance, leads to model overfitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features, and... yes, by increasing the number of data points. So, what kind of models were Banko & Brill's, and Norvig dealing with? Yes, you got it right: high variance. In both cases, the authors were working on language models in which roughly every word in the vocabulary makes a feature. These are models with many features as compared to the training examples. Therefore, they are likely to overfit. And, yes, in this case adding more examples will help.
But, in the opposite case, we might have a model that is too simple to explain the data we have. In that case, known as high bias, adding more data will not help. See below a plot of a real production system at Netflix and its performance as we add more training examples.
So, no, more data does not always help. As we have just seen there can be many cases in which adding more examples to our training set will not improve the model performance.
More features to the rescue
If you are with me so far, and you have done your homework in understanding high variance and high bias problems, you might be thinking that I have deliberately left something out of the discussion. Yes, high bias models will not benefit from more training examples, but they might very well benefit from more features. So, in the end, it is all about adding "more" data, right? Well, again, it depends.
Let's take the Netflix Prize, for example. Pretty early on in the game, there wasa blog post by serial entrepreneur and Stanford professor Anand Rajaraman commenting on the use of extra features to solve the problem. The post explains how a team of students got an improvement on the prediction accuracy by adding content features from IMDB.
In retrospect, it is easy to criticize the post for making a gross over-generalization from a single data point. Even more, the follow-up postreferences SVD as one of the "complex" algorithms not worth trying because it limits the ability of scaling up to larger number of features. Clearly, Anand's students did not win the Netflix Prize, and they probably now realize that SVD did have a major role in the winning entry.
As a matter of fact, many teams showed later that adding content features from IMDB or the like to an optimized algorithm had little to no improvement. Some of the members of the Gravity team, one of the top contenders for the Prize, published a detailed paper in which they showed how those content-based features would add no improvement to the highly optimized collaborative filtering matrix factorization approach. The paper was entitled "Recommending New Movies: Even a Few Ratings Are More Valuable Than Metadata".
To be fair, the title of the paper is also an over-generalization. Content-based features (or different features in general) might be able to improve accuracy in many cases. But, you get my point again: More data does not always help.
Better Data != More Data (Added this section in response to a comment)
It is important to point out that, in my opinion, better data is always better. There is no arguing against that. So any effort you can direct towards "improving" your data is always well invested. The issue is that better data does not mean more data. As a matter of fact, sometimes it might mean less!
Think of data cleansing or outlier removal as one trivial illustration of my point. But, there are many other examples that are more subtle. For example, I have seen people invest a lot of effort in implementing distributed Matrix Factorization when the truth is that they could have probably gotten by with sampling their data and gotten to very similar results. In fact, doing some form of smart sampling on your population the right way (e.g. using stratified sampling) can get you to better results than if you used the whole unfiltered data set.
The End of the Scientific Method?
Of course, whenever there is a heated debate about a possible paradigm change, there are people like Malcolm Gladwell or Chris Anderson that make a living out of heating it even more (don't get me wrong, I am a fan of both, and have read most of their books). In this case, Anderson picked on some of Norvig's comments, and misquoted them in an article entitled: "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete".
The article explains several examples of how the abundance of data helps people and companies take decision without even having to understand the meaning of the data itself. As Norvig himself points out in his rebuttal, Anderson has a few points right, but goes above and beyond to try to make them. And the result is a set of false statements, starting from the title: the data deluge does not make the scientific method obsolete. I would argue it is rather the other way around.
Data Without a Sound Approach = Noise
So, am I trying to make the point that the Big Data revolution is only hype? No way. Having more data, both in terms of more examples or more features, is a blessing. The availability of data enables more and better insights and applications. More data indeed enables better approaches. More than that, itrequires better approaches.
In summary, we should dismiss simplistic voices that proclaim the uselessness of theory or models, or the triumph of data over these. As much as data is needed, so are good models and theory that explains them. But, overall, what we need is good approaches that help us understand how to interpret data, models, and the limitations of both in order to produce the best possible output.
In other words, data is important. But, data without a sound approach becomes noise.
(Note: This answer is based on a post that I previously published on my blog:More data or better models?)
In machine learning, is more data always better than better algorithms?的更多相关文章
- Coursera, Big Data 4, Machine Learning With Big Data (week 1/2)
Week 1 Machine Learning with Big Data KNime - GUI based Spark MLlib - inside Spark CRISP-DM Week 2, ...
- [Machine Learning with Python] Data Preparation through Transformation Pipeline
In the former article "Data Preparation by Pandas and Scikit-Learn", we discussed about a ...
- [Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn
In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels ...
- Coursera, Big Data 4, Machine Learning With Big Data (week 3/4/5)
week 3 Classification KNN :基本思想是 input value 类似,就可能是同一类的 Decision Tree Naive Bayes Week 4 Evaluating ...
- 斯坦福大学公开课机器学习:machine learning system design | data for machine learning(数据量很大时,学习算法表现比较好的原理)
下图为四种不同算法应用在不同大小数据量时的表现,可以看出,随着数据量的增大,算法的表现趋于接近.即不管多么糟糕的算法,数据量非常大的时候,算法表现也可以很好. 数据量很大时,学习算法表现比较好的原理: ...
- [Machine Learning with Python] Data Visualization by Matplotlib Library
Before you can plot anything, you need to specify which backend Matplotlib should use. The simplest ...
- Machine Learning and Data Mining(机器学习与数据挖掘)
Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...
- 机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)
##机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2)---#####注:机器学习资料[篇目一](https://github.co ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
随机推荐
- 偷偷mark下一个
java书单 thinking in java java战 Effective Java 深入了解JVM虚拟机 java性能优化权威指南 JSR133 Google Guava官方教程 版权声明:本文 ...
- careercup-数组和字符串1.8
1.8 假定有一个方法isSubstring,可检查一个单词是否为其他字符串的子串.给定两个字符串s1和s2,请编写代码检查s2是否为s1旋转而成,要求只能调用一次isSubstring.旋转字符串: ...
- Vim程序编辑器
Vim的三种模式: 1) 一般模式 以 vi 打开一个档案就直接进入一般模式了(这是默认的模式).在这个模式中, 你可以使用『上下左右』按键来移动光标,你可以使用『删除字符』或『删除整行』来处理档案内 ...
- ubuntu14.04使用root用户登录桌面 分类: 学习笔记 linux ubuntu 2015-07-05 10:30 199人阅读 评论(0) 收藏
ubuntu安装好之后,默认是不能用root用户登录桌面的,只能使用普通用户或者访客登录.怎样开启root用户登录桌面呢? 先用普通用户登录,然后切换到root用户,然后执行如下命令: vi /usr ...
- Java基础知识强化之网络编程笔记03:UDP之UDP协议发送数据 和 接收数据
1. UDP协议发送数据 和 接收数据 UDP协议发送数据: • 创建发送端的Socket对象 • 创建数据,并把数据打包 • 调用Socket对象的发送方法,发送数据包 • 释放资源 UDP协议接 ...
- linux根下目录详解及分区建议
/ 根目录 分区大小一定要充足,一般不小于5GB/bin,/usr/bin 普通用户使用命令 建议和/放一起/sbin,/usr/sbin 管理员使用命令/bin,/sbin 操作系统自身 ...
- (转)Spring读书笔记-----Spring核心机制:依赖注入
Java应用(从applets的小范围到全套n层服务端企业应用)是一种典型的依赖型应用,它就是由一些互相适当地协作的对象构成的.因此,我们说这些对象间存在依赖关系.加入A组件调用了B组件的方法,我们就 ...
- C#当中的多线程_线程同步
第2章 线程同步 原来以为线程同步就是lock,monitor等呢,看了第二章真是大开眼界啊! 第一章中我们遇到了一个叫做竞争条件的问题.引起的原因是没有进行正确的线程同步.当一个线程在执行操作时候, ...
- Google Code项目代码托管网站上Git版本控制系统使用简明教程
作为一个著名的在线项目代码托管网站,Google Code目前主要支持三种版本控制系统,分别为Git, Mercurial和 Subversion.Subversion即SVN相信大家都已经熟知了,这 ...
- orainstRoot.sh到底执行了哪些操作
1 #!/bin/sh 1 #!/bin/sh 2 AWK=/bin/awk 3 CHMOD=/bin/chmod 4 CHGRP=/bin/chgrp ...