Machine Learning Done Wrong
Machine Learning Done Wrong
Statistical modeling is a lot like engineering.
In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.
When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.
As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.
1. Take default loss function for granted
Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).
2. Use plain linear models for non-linear interaction
When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.
3. Forget about outliers
Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.
Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.
4. Use high variance model when n<<p
SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) -- common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.
5. L1/L2/... regularization without standardization
Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.
Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.
6. Use linear model without considering multi-collinear predictors
Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.
7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance
Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley". Also, special thanks Ian Wong (@ihat) for reading a draft of this.
Cheng-Tao Chu
Director of Analytics at Codecademy. Specialties: data engineering and machine learning. Formerly: Google, LinkedIn and Square.
Machine Learning Done Wrong的更多相关文章
- 【Machine Learning】KNN算法虹膜图片识别
K-近邻算法虹膜图片识别实战 作者:白宁超 2017年1月3日18:26:33 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...
- 【Machine Learning】Python开发工具:Anaconda+Sublime
Python开发工具:Anaconda+Sublime 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现 ...
- 【Machine Learning】机器学习及其基础概念简介
机器学习及其基础概念简介 作者:白宁超 2016年12月23日21:24:51 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...
- 【Machine Learning】决策树案例:基于python的商品购买能力预测系统
决策树在商品购买能力预测案例中的算法实现 作者:白宁超 2016年12月24日22:05:42 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本 ...
- 【机器学习Machine Learning】资料大全
昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...
- [Machine Learning] Active Learning
1. 写在前面 在机器学习(Machine learning)领域,监督学习(Supervised learning).非监督学习(Unsupervised learning)以及半监督学习(Semi ...
- [Machine Learning & Algorithm]CAML机器学习系列2:深入浅出ML之Entropy-Based家族
声明:本博客整理自博友@zhouyong计算广告与机器学习-技术共享平台,尊重原创,欢迎感兴趣的博友查看原文. 写在前面 记得在<Pattern Recognition And Machine ...
- machine learning基础与实践系列
由于研究工作的需要,最近在看机器学习的一些基本的算法.选用的书是周志华的西瓜书--(<机器学习>周志华著)和<机器学习实战>,视频的话在看Coursera上Andrew Ng的 ...
- matlab基础教程——根据Andrew Ng的machine learning整理
matlab基础教程--根据Andrew Ng的machine learning整理 基本运算 算数运算 逻辑运算 格式化输出 小数位全局修改 向量和矩阵运算 矩阵操作 申明一个矩阵或向量 快速建立一 ...
- Machine Learning
Recently, I am studying Maching Learning which is our course. My English is not good but this course ...
随机推荐
- 实例介绍Cocos2d-x精灵菜单和图片菜单
精灵菜单类是MenuItemSprite,图片菜单类是MenuItemImage.由于MenuItemImage继承于MenuItemSprite,所以图片菜单也属于精灵菜单.为什么叫精灵菜单呢?那是 ...
- 在浏览器中打开本地应用(iOS)
在浏览器中点击跳转到本地应用的方法(如果本地没有安装的话) 然后在浏览器中输入tianxiang://就能打开这个应用了 ................省略 遇到一个12年还是初中的小朋友,
- iOS-RunLoop,为手机省电,节省CPU资源,程序离不开的机制
RunLoop是什么?基本操作是什么? 1.RunLoop的作用 RunLoop可以: 保持程序的持续运行 处理App中的各种事件(比如触摸事件.定时器事件.Selector事件) 节省CPU资源,提 ...
- python基础:day3作业
修改haproxy配置文件 基本功能:1.获取记录2.添加记录3.删除记录 代码结构:三个函数一个主函数 知识点:1.python简单数据结构的使用:列表.字典等 2.python两个模块的使用:os ...
- 学习笔记:JavaScript传参方式———ECMAScript中所有函数的参数都是按值传递
我们把命名参数(arguments)视为局部变量,在向参数传递基本类型值时,如同基本类型变量的复制一样,传递一个副本,参数在函数内部的改变不会影响外部的基本类型值.如: function add10( ...
- http: URL、请求方式
URL:统一资源定位符,可以从互联网得到和访问资源,是标准资源的位置 结构包括协议.服务器名称(IP地址).端口(有的服务器需要).路径.文件名 协议:告诉浏览器如何处理打开的文件,常用的就是http ...
- 6个超炫酷的HTML5电子书翻页动画
相信大家一定遇到过一些电子书网站,我们可以通过像看书一样翻页来浏览电子书的内容.今天我们要分享的HTML5应用跟电子书翻页有关,我们精选出来的6个电子书翻页动画都非常炫酷,而且都提供源码下载,有需要的 ...
- hdu 1874 畅通工程续 Dijkstra
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1874 题目分析:输入起点和终点,顶点的个数,已连通的边. 输出起点到终点的最短路径,若不存在,输出-1 ...
- 深度模拟java动态代理实现机制系类之三
这里的内容就比较复杂了,要实现的是对任意的接口,对任意指定的方法,以及对任意指定的代理类型进行代理,就更真实的模拟出java虚拟机的动态代理机制 罗列一下这里涉及的类.接口之间的关系,方便大家学习.1 ...
- Arduino CNC Shiled 和 DRV8825驱动板的注意事项
首先说明硬件:1) Arduino CNC Shiled V2.6 2)DRV8825驱动板 3)光驱步进电机 4)Arduino uno R3 下图是本次主角是Arduino CNC Shile ...