Common Pitfalls In Machine Learning Projects
Common Pitfalls In Machine Learning Projects
In a recent presentation, Ben
Hamner described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle.
The talk was titled “Machine Learning
Gremlins” and was presented in February
2014 at Strata.
In this post we take a look at the pitfalls from Ben’s talk, what they look like and how to avoid them.
Machine Learning Process
Early in the talk, Ben presented a snap-shot of the process for working a machine learning problem end-to-end.
Machine Learning Process
Taken from “Machine Learning Gremlins” by Ben Hamner
This snapshot included 9 steps, as follows:
- Start with a business problem
- Source data
- Split data
- Select an evaluation metric
- Perform feature extraction
- Model Training
- Feature Selection
- Model Selection
- Production System
He commented that the process is iterative rather than linear.
He also commented that each step in this process can go wrong, derailing the whole project.
Discriminating Dogs and Cats
Ben presented a case study problem for building an automatic cat door that can let the cat in and keep the dog out. This was an instructive example as it touched on a number of key problems in working a data problem.
Discriminating Dogs and Cats
Taken from “Machine Learning Gremlins” by Ben Hamner
Sample Size
The first great takeaway from this example was that he studied accuracy of the model against data sample size and showed that more samples correlated with greater accuracy.
He then added more data until accuracy leveled off. This was a great example of understanding how easy it can be get an idea of the sensitivity of your system to sample size and adjust accordingly.
Wrong Problem
The second great takeaway from this example was that the system failed, it let in all cats in the neighborhood.
It was a clever example highlighting the importance of understanding the constraints of the problem that needs to be solved, rather than the problem that you want to solve.
Pitfalls In Machine Learning Projects
Ben went on to discuss four common pitfalls in when working on machine learning problems.
Although these problems are common, he points out that they can be identified and addressed relatively easily.
Overfitting
Taken from “Machine Learning Gremlins” by Ben Hamner
- Data Leakage: The problem of making use of data in the model to which a production system would not have access. This is particularly common
in time series problems. Can also happen with data like system id’s that may indicate a class label. Run a model and take a careful look at the attributes that contribute to the success of the model. Sanity check and consider whether it makes sense. (check
out the referenced paper “Leakage
in Data Mining” PDF) - Overfitting: Modeling the training data too closely such that the model also includes noise in the model. The result is poor ability to generalize.
This becomes more of a problem in higher dimensions with more complex class boundaries. - Data Sampling and Splitting: Related to data leakage, you need to very careful that the train/test/validation sets are indeed independent
samples. Much thought and work is required for time series problems to ensure that you can reply data to the system chronologically and validate model accuracy. - Data Quality: Check the consistency of your data. Ben gave an example of flight data where some aircraft were landing before taking off. Inconsistent,
duplicate, and corrupt data needs to be identified and explicitly handled. It can directly hurt the modeling problem and ability of a model to generalize.
Summary
Ben’s talk “Machine Learning Gremlins”
is a quick and practical talk.
You will get a useful crash course in the common pitfalls we are all susceptible to when working on a data problem.
Common Pitfalls In Machine Learning Projects的更多相关文章
- [C5] Andrew Ng - Structuring Machine Learning Projects
About this Course You will learn how to build a successful machine learning project. If you aspire t ...
- 《Structuring Machine Learning Projects》课堂笔记
Lesson 3 Structuring Machine Learning Projects 这篇文章其实是 Coursera 上吴恩达老师的深度学习专业课程的第三门课程的课程笔记. 参考了其他人的笔 ...
- 课程三(Structuring Machine Learning Projects),第一周(ML strategy(1)) —— 0.Learning Goals
Learning Goals Understand why Machine Learning strategy is important Apply satisficing and optimizin ...
- 吴恩达《深度学习》-课后测验-第三门课 结构化机器学习项目(Structuring Machine Learning Projects)-Week1 Bird recognition in the city of Peacetopia (case study)( 和平之城中的鸟类识别(案例研究))
Week1 Bird recognition in the city of Peacetopia (case study)( 和平之城中的鸟类识别(案例研究)) 1.Problem Statement ...
- 课程三(Structuring Machine Learning Projects),第一周(ML strategy(1)) —— 1.Machine learning Flight simulator:Bird recognition in the city of Peacetopia (case study)
[]To help you practice strategies for machine learning, the following exercise will present an in-de ...
- Structuring Machine Learning Projects 笔记
1 Machine Learning strategy 1.1 为什么有机器学习调节策略 当你的机器学习系统的性能不佳时,你会想到许多改进的方法.但是选择错误的方向进行改进,会使你花费大量的时间,但是 ...
- 课程回顾-Structuring Machine Learning Projects
正交化 Orthogonalization单一评价指标保证训练.验证.测试的数据分布一致不同的错误错误分析数据分布不一致迁移学习 transfer learning多任务学习 Multi-task l ...
- Coursera Deep Learning 3 Structuring Machine Learning Projects, ML Strategy
Why ML stategy 怎么提高预测准确度?有了stategy就知道从哪些地方入手,而不至于找错方向做无用功. Satisficing and Optimizing metric 上图中,run ...
- 课程三(Structuring Machine Learning Projects),第二周(ML strategy(2)) —— 1.Machine learning Flight simulator:Autonomous driving (case study)
[中文翻译] 为了帮助您练习机器学习的策略, 在本周我们将介绍另一个场景, 并询问您将如何行动.我们认为, 这个工作在一个机器学习项目的 "模拟器" 将给一个任务, 告诉你一个机器 ...
随机推荐
- C语言 详解多级指针与指针类型的关系
//V推论①:指针变量的步长只与‘指针变量的值’的类型有关(指针的值的类型 == 指针指向数据的类型) //指针类型跟指针的值有关,指针是占据4个字节大小的内存空间,但是指针的类型却是各不相同的 // ...
- ASP.NET中进行消息处理(MSMQ) 二
在我上一篇文章<ASP.NET中进行消息处理(MSMQ)一>里对MSMQ做了个通俗的介绍,最后以发送普通文本消息和复杂的对象消息为例介绍了消息队列的使用. 本文在此基础上继续介绍MSMQ的 ...
- 20135220谈愈敏Linux_总结
Linux_总结 具体博客链接 计算机是如何工作的 操作系统是如何工作的 构造一个简单的Linux系统MenuOS 系统调用(上) 系统调用(下) 进程的描述和创建 可执行程序的装载 进程的切换和系统 ...
- 备份U盘分区表,未雨绸缪
有时候,由于操作不当将U盘或者移动硬盘插入到电脑的时候会变成RAW格式,不可读取,这样的话就杯具了,只能用恢复软件试试看. 但是,如果一开始进行了备份的话,处理起来就简单多了. 用winhex打开U盘 ...
- 【MyEclipse 2015】 逆向破解实录系列【1】(纯研究)
声明 My Eclipse 2015 程序版权为Genuitec, L.L.C所有. My Eclipse 2015 的注册码.激活码等授权为Genuitec, L.L.C及其付费用户所有. 本文只从 ...
- SQL Server2008 列名显示无效
在SQLServer2008中,当设计(修改)表结构之后,再用SQL语句时,列名会显示无效,但执行可以通过 如下图: 原因是SQL Server的intellisense(智能感知功能)需要重新整理一 ...
- Scala 中的函数式编程基础(三)
主要来自 Scala 语言发明人 Martin Odersky 教授的 Coursera 课程 <Functional Programming Principles in Scala>. ...
- asp.net 预编译和动态编译
在asp.net中,编译可以分为:动态编译Dynamical Compilation和预编译(Precompilation). 动态编译 深入剖析ASP.NET的编译原理之一:动态编译(Dynamic ...
- nonatomic, retain,weak,strong用法详解
strong weak strong与weak是由ARC新引入的对象变量属性 ARC引入了新的对象的新生命周期限定,即零弱引用.如果零弱引用指向的对象被deallocated的话,零弱引用的对象会被自 ...
- JAVA成员变量为什么不能在类体中先定义后赋值
package dx; public class Test1 { int a111;//定义成员变量(全局变量) // a = 1;//此处若给变量赋值,会报错,JAVA所有的除定义或声明语句之外的任 ...