Common Pitfalls In Machine Learning Projects

In a recent presentation, Ben
Hamner
 described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle.

The talk was titled “Machine Learning
Gremlins
” and was presented in February
2014 at Strata
.

In this post we take a look at the pitfalls from Ben’s talk, what they look like and how to avoid them.

Machine Learning Process

Early in the talk, Ben presented a snap-shot of the process for working a machine learning problem end-to-end.

Machine Learning Process

Taken from “Machine Learning Gremlins” by Ben Hamner

This snapshot included 9 steps, as follows:

  1. Start with a business problem
  2. Source data
  3. Split data
  4. Select an evaluation metric
  5. Perform feature extraction
  6. Model Training
  7. Feature Selection
  8. Model Selection
  9. Production System

He commented that the process is iterative rather than linear.

He also commented that each step in this process can go wrong, derailing the whole project.

Discriminating Dogs and Cats

Ben presented a case study problem for building an automatic cat door that can let the cat in and keep the dog out. This was an instructive example as it touched on a number of key problems in working a data problem.

Discriminating Dogs and Cats

Taken from “Machine Learning Gremlins” by Ben Hamner

Sample Size

The first great takeaway from this example was that he studied accuracy of the model against data sample size and showed that more samples correlated with greater accuracy.

He then added more data until accuracy leveled off. This was a great example of understanding how easy it can be get an idea of the sensitivity of your system to sample size and adjust accordingly.

Wrong Problem

The second great takeaway from this example was that the system failed, it let in all cats in the neighborhood.

It was a clever example highlighting the importance of understanding the constraints of the problem that needs to be solved, rather than the problem that you want to solve.

Pitfalls In Machine Learning Projects

Ben went on to discuss four common pitfalls in when working on machine learning problems.

Although these problems are common, he points out that they can be identified and addressed relatively easily.

Overfitting

Taken from “Machine Learning Gremlins” by Ben Hamner

  • Data Leakage: The problem of making use of data in the model to which a production system would not have access. This is particularly common
    in time series problems. Can also happen with data like system id’s that may indicate a class label. Run a model and take a careful look at the attributes that contribute to the success of the model. Sanity check and consider whether it makes sense. (check
    out the referenced paper “Leakage
    in Data Mining
    ” PDF)
  • Overfitting: Modeling the training data too closely such that the model also includes noise in the model. The result is poor ability to generalize.
    This becomes more of a problem in higher dimensions with more complex class boundaries.
  • Data Sampling and Splitting: Related to data leakage, you need to very careful that the train/test/validation sets are indeed independent
    samples. Much thought and work is required for time series problems to ensure that you can reply data to the system chronologically and validate model accuracy.
  • Data Quality: Check the consistency of your data. Ben gave an example of flight data where some aircraft were landing before taking off. Inconsistent,
    duplicate, and corrupt data needs to be identified and explicitly handled. It can directly hurt the modeling problem and ability of a model to generalize.

Summary

Ben’s talk “Machine Learning Gremlins
is a quick and practical talk.

You will get a useful crash course in the common pitfalls we are all susceptible to when working on a data problem.

机器学习项目中常见的误区

Machine Learning Gremlins.mp4

Common Pitfalls In Machine Learning Projects的更多相关文章

  1. [C5] Andrew Ng - Structuring Machine Learning Projects

    About this Course You will learn how to build a successful machine learning project. If you aspire t ...

  2. 《Structuring Machine Learning Projects》课堂笔记

    Lesson 3 Structuring Machine Learning Projects 这篇文章其实是 Coursera 上吴恩达老师的深度学习专业课程的第三门课程的课程笔记. 参考了其他人的笔 ...

  3. 课程三(Structuring Machine Learning Projects),第一周(ML strategy(1)) —— 0.Learning Goals

    Learning Goals Understand why Machine Learning strategy is important Apply satisficing and optimizin ...

  4. 吴恩达《深度学习》-课后测验-第三门课 结构化机器学习项目(Structuring Machine Learning Projects)-Week1 Bird recognition in the city of Peacetopia (case study)( 和平之城中的鸟类识别(案例研究))

    Week1 Bird recognition in the city of Peacetopia (case study)( 和平之城中的鸟类识别(案例研究)) 1.Problem Statement ...

  5. 课程三(Structuring Machine Learning Projects),第一周(ML strategy(1)) —— 1.Machine learning Flight simulator:Bird recognition in the city of Peacetopia (case study)

    []To help you practice strategies for machine learning, the following exercise will present an in-de ...

  6. Structuring Machine Learning Projects 笔记

    1 Machine Learning strategy 1.1 为什么有机器学习调节策略 当你的机器学习系统的性能不佳时,你会想到许多改进的方法.但是选择错误的方向进行改进,会使你花费大量的时间,但是 ...

  7. 课程回顾-Structuring Machine Learning Projects

    正交化 Orthogonalization单一评价指标保证训练.验证.测试的数据分布一致不同的错误错误分析数据分布不一致迁移学习 transfer learning多任务学习 Multi-task l ...

  8. Coursera Deep Learning 3 Structuring Machine Learning Projects, ML Strategy

    Why ML stategy 怎么提高预测准确度?有了stategy就知道从哪些地方入手,而不至于找错方向做无用功. Satisficing and Optimizing metric 上图中,run ...

  9. 课程三(Structuring Machine Learning Projects),第二周(ML strategy(2)) —— 1.Machine learning Flight simulator:Autonomous driving (case study)

    [中文翻译] 为了帮助您练习机器学习的策略, 在本周我们将介绍另一个场景, 并询问您将如何行动.我们认为, 这个工作在一个机器学习项目的 "模拟器" 将给一个任务, 告诉你一个机器 ...

随机推荐

  1. 【原创】有关Silverlight中“DataGrid中单元格动态绑定ComboBox单击时数据项莫名被清除 ”的解决方案及思路。

    今天上班遇到一个很古怪的问题,搞了半天愣是没找到原因.是这样的,在Datagrid中有绑定一个ComboBox列,其不包含在 model数据中,而是单独在LoadingRow事件中去 从数据库拿数据绑 ...

  2. 公钥(Public Key)与私钥(Private Key)

    公钥(Public Key)与私钥(Private Key)是通过一种算法得到的一个密钥对(即一个公钥和一个私钥),公钥是密钥对中公开的部分,私钥则是非公开的部分.公钥通常用于加密会话密钥.验证数字签 ...

  3. XPath 详解,总结

    XPath简介 XPath是W3C的一个标准.它最主要的目的是为了在XML1.0或XML1.1文档节点树中定位节点所设计.目前有XPath1.0和XPath2.0两个版本.其中Xpath1.0是199 ...

  4. 在opencv3中实现机器学习之:利用svm(支持向量机)分类

    svm分类算法在opencv3中有了很大的变动,取消了CvSVMParams这个类,因此在参数设定上会有些改变. opencv中的svm分类代码,来源于libsvm. #include "s ...

  5. 做leetcode的几点体会分享(转)

    1 大部分题目你都是可以自己做出来的.所以,第一遍尽量不要网上找答案: 2 写了的不管通过的,不通过的答案要保存下来.不通过的,也要记录下来哪儿没有通过.很有可能你这次错了,不知道怎么搞过了,下次还是 ...

  6. 定制类自己的的new_handler

    C++中的new操作符首先使用operator new函数来分配空间,然后再在此空间上调用类的构造函数构造对象.当operator new无法分配所需的内存空间时,默认的情况下会抛出一个bad_all ...

  7. 细说 Web API参数绑定和模型绑定

    今天跟大家分享下在Asp.NET Web API中Controller是如何解析从客户端传递过来的数据,然后赋值给Controller的参数的,也就是参数绑定和模型绑定. Web API参数绑定就是简 ...

  8. WPF开发时光之痕日记本

       很久没有写东西了,新的一年新的开始吧. 很早就想自己开发一款日记本软件不仅自己使用,也可以让大家免费使用,最主要的是对自己有一个认可,自学WPF以来,感觉不很顺利,WPF的资料相对来说有点少,主 ...

  9. C#中的yield return与Unity中的Coroutine(协程)(上)

    C#中的yield return C#语法中有个特别的关键字yield, 它是干什么用的呢? 来看看专业的解释: yield 是在迭代器块中用于向枚举数对象提供值或发出迭代结束信号.它的形式为下列之一 ...

  10. .net,微软,薪资及其他

    很久没在博客园上写些东西,因为我的确没有什么技术上面新奇的心得和大家分享,园子里面的文章页没啥看的,基本就是看一下业界新闻,因为这里面99%的东西没什么看头,更像是个人技术笔记汇总. 我从07年从de ...