11 Clever Methods of Overfitting and how to avoid them

Overfitting is the bane of Data Science in the age of Big Data. John Langford reviews "clever" methods of overfitting, including traditional, parameter tweak, brittle measures, bad statistics, human-loop overfitting, and gives suggestions and directions for avoiding overfitting.

By John Langford (Microsoft, Hunch.net)

(Gregory Piatetsky: I recently came across this classic post from 2005 by John Langford Clever Methods of Overfitting, which addresses one of the most critical issues in Data Science. The problem of overfitting is a major bane of big data, and the issues described below are perhaps even more relevant than before. I have made several of these mistakes myself in the past. John agreed to repost it in KDnuggets, so enjoy and please comment if you find new methods)

“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: over-representing performance on particular datasets and (implicitly) over-representing performance of a method on future datasets.

We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.

1. Traditional overfitting: Train a complex predictor on too-few examples.

Remedy:

  1. Hold out pristine examples for testing.
  2. Use a simpler predictor.
  3. Get more training examples.
  4. Integrate over many predictors.
  5. Reject papers which do this.

2. Parameter tweak overfitting: Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.

For example, choosing the features so as to optimize test set performance can achieve this.

Remedy: same as above

3. Brittle measure: Use a measure of performance which is especially brittle to overfitting.

Examples: “entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.

Remedy: Prefer less brittle measures of performance.

4. Bad statistics: Misuse statistics to overstate confidences.

One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.

Remedy: Don’t do this. Reject papers which do this.

5. Choice of measure: Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.

This is fairly common and tempting.

Remedy: Use canonical performance measures. For example, the performance measure directly motivated by the problem.

6. Incomplete Prediction: Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.

Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.

Remedy: Reject papers which do this.

7. Human-loop overfitting: Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.

This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.

Remedy: Make sure test examples are not available to the human.

8. Data set selection: Chose to report results on some subset of datasets where your algorithm performs well.

The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.

Remedy: Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.

9. Reprobleming: Alter the problem so that your performance improves.

Remedy: For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.

Remedy: Discount papers which do this. Make sure problem specifications are clear.

10. Old datasets: Create an algorithm for the purpose of improving performance on old datasets.

After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…

Remedy: Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.

11. Overfitting by review: 10 people submit a paper to a conference. The one with the best result is accepted.

This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.

Remedy: 

  1. Be more pessimistic of confidence statements by papers at high rejection rate conferences.
  2. Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)

I have personally observed all of these methods in action, and there are doubtless others.

Selected comments on John's post:

Negative results:

  • Aleks Jakulin: How about an index of negative results in machine learning? There’s a Journal of Negative Results in other domains:Ecology & Evolutionary BiologyBiomedicine, and there is Journal of Articles in Support of the Null Hypothesis. A section on negative results in machine learning conferences? This kind of information is very useful in preventing people from taking pathways that lead nowhere: if one wants to classify an algorithm into good/bad, one certainly benefits from unexpectedly bad examples too, not just unexpectedly good examples.
  • John Langford I visited the workshop on negative results at NIPS 2002. My impression was that it did not work well. The difficulty with negative results in machine learning is that they are too easy. For example, there are a plethora of ways to say that “learning is impossible (in the worst case)”. On the applied side, it’s still common for learning algorithms to not work on simple-seeming problems. In this situation, positive results (this works) are generally more valuable than negative results (this doesn’t work).

Brittle measures

  • What do you mean by “brittle”? Why is mutual information brittle?
  • John Langford : What I mean by brittle: Suppose you have a box which takes some feature values as input and predicts some probability of label 1 as output. You are not allowed to open this box or determine how it works other than by this process of giving it inputs and observing outputs. 

    Let x be an input.
    Let y be an output.
    Assume (x,y) are drawn from a fixed but unknown distribution D.
    Let p(x) be a prediction.

    For classification error I(|y – p(x)| < 0.5) you can prove a theorem of the rough form:
    for all D, with high probability over the draw of m examples independently from D, expected classification error rate of the box with respect to D is bounded by a function of the observations. 

    What I mean by “brittle” is that no statement of this sort can be made for any unbounded loss (including log-loss which is integral to mutual information and entropy). You can of course open up the box and analyze its structure or make extra assumptions about D to get a similar but inherently more limited analysis.

    The situation with leave-one-out cross validation is not so bad, but it’s still pretty bad. In particular, there exists a very simple learning algorithm/problem pair with the property that the leave-one-out estimate has the variance and deviations of a single coin flip. Yoshua Bengio and Yves Grandvalet in fact proved that there is no unbiased estimator of variance. The paper that I pointed to above shows that for K-fold cross validation on m examples, all moments of the deviations might only be as good as on a test set of size $m/K$.

    I’m not sure what a ‘valid summary’ is, but leave-one-out cross validation can not provide results I trust, because I know how to break it.

    I have personally observed people using leave-one-out cross validation with feature selection to quickly achieve a severe overfit.

Related:

11 Clever Methods of Overfitting and how to avoid them的更多相关文章

  1. Using SetWindowRgn

    Using SetWindowRgn Home Back To Tips Page Introduction There are lots of interesting reasons for cre ...

  2. Java Hashtable的实现

    先附源码: package java.util; import java.io.*; /** * This class implements a hash table, which maps keys ...

  3. HashMap和HashTable到底哪不同?

    HashMap和HashTable有什么不同?在面试和被面试的过程中,我问过也被问过这个问题,也见过了不少回答,今天决定写一写自己心目中的理想答案. 代码版本 JDK每一版本都在改进.本文讨论的Has ...

  4. Android Studio上面最好用的插件

    转载:http://www.jianshu.com/p/d76b60a3883d 在开发过程中,本人用的最爽的就是代码生成的插件,帮助我们自动完成大量重复简单的工作.个人也觉得代码自动生成工具是最值得 ...

  5. 编写更好的C#代码

    引言 开发人员总是喜欢就编码规范进行争论,但更重要的是如何能够在项目中自始至终地遵循编码规范,以保证项目代码的一致性.并且团队中的所有人都需要明确编码规范所起到的作用.在这篇文章中,我会介绍一些在我多 ...

  6. Java性能提示(全)

    http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLi ...

  7. Gradle cookbook(转)

    build.gradle apply plugin:"java" [compileJava,compileTestJava,javadoc]*.options*.encoding ...

  8. The Go Programming Language. Notes.

    Contents Tutorial Hello, World Command-Line Arguments Finding Duplicate Lines A Web Server Loose End ...

  9. 00-Unit_Common综述-RecyclerView封装

    自学安卓也有一年的时间了,与代码相伴的日子里,苦乐共存.能坚持到现在确实已见到了"往日所未曾见证的风采".今2018年4月2日,决定用一个案例:Unit_Common,把安卓基础的 ...

随机推荐

  1. (Extjs)对于GridPanel的各种操作

    刚才做了个有点特殊的需求,在某窗口关闭时,要把Gridpanel中的选择行清空,因为如果不清空,直接双击,就不能即时更新出来我想要的内容. 答案是:Grid.getSelectionModel().c ...

  2. [转]easyui tree 模仿ztree 使用扁平化加载json

    原文地址:http://my.oschina.net/acitiviti/blog/349377 参考文章:http://www.jeasyuicn.com/demo/treeloadfilter.h ...

  3. Dubbo系列_概述

    一.本文目的         学习使用Dubbo也有一段时间了,准备写一个系列文章介绍Dubbo的相关知识和使用,供自己以后回顾和他人学习.有兴趣的同学可以加入群:74085440一起探讨 二.书写计 ...

  4. 网络流HDU 2883

    建图           源点  ->     每个人  ->           每段时间      ->      汇点 时间要离散化一下 分成一些时间段 权           ...

  5. The prefix "tx" for element "tx:annotation-driven " is not bound

    The prefix "tx" for element "tx:advice" is not bound 这个错误的原因很简单是: 我们在定义申明AOP的时候. ...

  6. windows无法安装到这个磁盘怎样解决

    装操作系统,出提示:windows无法安装到这个磁盘.这台计算机的硬件可能不支持启动到此盘.你甚至用专业的分区软件都无法解决这个问题,比如说PM等.一般在更换好新的硬盘的时候或者将操作系统装入移动硬盘 ...

  7. EF(Entity Framework)发生错误”正在创建模型,此时不可使用上下文“的解决办法。 正在创建模型,此时不可使用上下文。如果在 OnModelCreating 方法内使用上下文或如果多个线程同时访问同一上下文实例,可能引发此异常。请注意不保证 DbContext 的实例成员和相关类是线程安全的。 临时解决了这个问题,在Context的构造函数中,禁用了自动初始化:

    解决方案: 禁止上下创建. 修改.删除,默认为true public DataDbContext() : base("name=DataDbContext") {  this.Da ...

  8. Spring常用注解,自动扫描装配Bean

    1 引入context命名空间(在Spring的配置文件中),配置文件如下: xmlns:context="http://www.springframework.org/schema/con ...

  9. js的并行加载与顺序执行

    javaScript文件(下面简称脚本文件)需要被HTML文件引用才能在浏览器中运行.在HTML文件中可以通过不同的方式来引用脚本文件,我们需要关注的是,这些方式的具体实现和这些方式可能会带来的性能问 ...

  10. Java多线程与并发库高级应用-传统线程机制回顾

    1.传统线程机制的回顾 1.1创建线程的两种传统方式 在Thread子类覆盖的run方法中编写运行代码 // 1.使用子类,把代码放到子类的run()中运行 Thread thread = new T ...