Lessons learned developing a practical large scale machine learning system
原文:http://googleresearch.blogspot.jp/2010/04/lessons-learned-developing-practical.html
Lessons learned developing a practical large scale machine learning system
When faced with a hard prediction problem, one possible approach is to attempt to perform statistical miracles on a small training set. If data is abundant then often a more fruitful approach is to design a highly scalable learning system and use several orders of magnitude more training data.
This general notion recurs in many other fields as well. For example, processing large quantities of data helps immensely for information retrieval and machine translation.
Several years ago we began developing a large scale machine learning system, and have been refining it over time. We gave it the codename “Seti” because it searches for signals in a large space. It scales to massive data sets and has become one of the most broadly used classification systems at Google.
After building a few initial prototypes, we quickly settled on a system with the following properties:
- Binary classification (produces a probability estimate of the class label)
- Parallelized
- Scales to process hundreds of billions of instances and beyond
- Scales to billions of features and beyond
- Automatically identifies useful combinations of features
- Accuracy is competitive with state-of-the-art classifiers
- Reacts to new data within minutes
Seti’s accuracy appears to be pretty decent. For example, tests on standard smaller datasets indicate that it is comparable with modern classifiers.
Seti has the flexibility to be used on a broad range of training set sizes and feature sets. These sizes are substantially larger than those typically used in academia (e.g., the largest UCI datasethas 4 million instances). A sample of the data sets used with Seti gives the following statistics:
| Training set size | Unique features | |
| Mean | 100 Billion | 1 Billion |
| Median | 1 Billion | 10 Million |
A good machine learning system is all about accuracy, right?
In the process of designing Seti we made plenty of mistakes. However, we made some good key decisions as well. Here are a few of the practical lessons that we learned. Some are obvious in hindsight, but we did not necessarily realize their importance at the time.
Lesson: Keep it simple (even at the expense of a little accuracy).
- Ease of use: Teams are more willing to experiment with a machine learning system that is simple to set up and use. Those teams are not necessarily die-hard machine learning experts, and so they do not want to waste much time figuring out how to get a system up and running.
- System reliability: Teams are much more willing to deploy a reliable machine learning system in a live environment. They want a system that is dependable and unlikely to crash or need constant attention. Early versions of Seti had marginally better accuracy on large data sets, but were complex, stressed the network and GFS architecture considerably, and needed constant babysitting. The number of teams willing to deploy these versions was low.
Seti is typically used in places where a machine learning system will provide a significant improvement in accuracy over the existing system. The gains are usually large enough that most teams do not care about the small differences in accuracy between different flavors of algorithms. And, in practice, the small differences are often washed out by other effects such as better data filtering, adding another useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can be the difference between a deployable system and one that gets abandoned.
It is perhaps less academically interesting to design an algorithm that is slightly worse in accuracy, but that has greater ease of use and system reliability. However, in our experience, it is very valuable in practice.
Lesson: Start with a few specific applications in mind.
- We could examine what the small number of domains had in common. By building something that would work for a few domains, it was likely the resulting system would be useful for others.
- More importantly, it helped us quickly decide what aspects were unnecessary. We noticed that it was surprisingly easy to over-generalize or over-engineer a machine learning system. The domains grounded our project in reality and drove our decision making. Without them, even deciding how broad to make the input file format would have been harder (e.g., is it important to permit binary/categorical/real-valued features? Multiple classes? Fractional labels? Weighted instances?).
- Working with a few different teams as initial guinea pigs allowed us to learn about common teething problems, and helped us smooth the process of deployment for future teams.
Lesson: Know when to say “no”.
Seti is often used in places where there is a good chance of significantly improving predictive accuracy over the incumbent system. And we usually advise teams against trying the system when we believe there is likely to be only a small improvement.
Large-scale machine learning is an important and exciting area of research. It can be applied to many real world problems. We hope that we have given a flavor of the challenges that we face, and some of the practical lessons that we have learned.
Lessons learned developing a practical large scale machine learning system的更多相关文章
- 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 17—Large Scale Machine Learning 大规模机器学习
Lecture17 Large Scale Machine Learning大规模机器学习 17.1 大型数据集的学习 Learning With Large Datasets 如果有一个低方差的模型 ...
- [C12] 大规模机器学习(Large Scale Machine Learning)
大规模机器学习(Large Scale Machine Learning) 大型数据集的学习(Learning With Large Datasets) 如果你回顾一下最近5年或10年的机器学习历史. ...
- 大规模机器学习(Large Scale Machine Learning)
本博客是针对Andrew Ng在Coursera上的machine learning课程的学习笔记. 目录 在大数据集上进行学习(Learning with Large Data Sets) 随机梯度 ...
- (原创)Stanford Machine Learning (by Andrew NG) --- (week 10) Large Scale Machine Learning & Application Example
本栏目来源于Andrew NG老师讲解的Machine Learning课程,主要介绍大规模机器学习以及其应用.包括随机梯度下降法.维批量梯度下降法.梯度下降法的收敛.在线学习.map reduce以 ...
- 斯坦福第十七课:大规模机器学习(Large Scale Machine Learning)
17.1 大型数据集的学习 17.2 随机梯度下降法 17.3 微型批量梯度下降 17.4 随机梯度下降收敛 17.5 在线学习 17.6 映射化简和数据并行 17.1 大型数据集的学习
- 吴恩达机器学习笔记60-大规模机器学习(Large Scale Machine Learning)
一.随机梯度下降算法 之前了解的梯度下降是指批量梯度下降:如果我们一定需要一个大规模的训练集,我们可以尝试使用随机梯度下降法(SGD)来代替批量梯度下降法. 在随机梯度下降法中,我们定义代价函数为一个 ...
- Ng第十七课:大规模机器学习(Large Scale Machine Learning)
17.1 大型数据集的学习 17.2 随机梯度下降法 17.3 微型批量梯度下降 17.4 随机梯度下降收敛 17.5 在线学习 17.6 映射化简和数据并行 17.1 大型数据集的学习 ...
- Coursera在线学习---第十节.大规模机器学习(Large Scale Machine Learning)
一.如何学习大规模数据集? 在训练样本集很大的情况下,我们可以先取一小部分样本学习模型,比如m=1000,然后画出对应的学习曲线.如果根据学习曲线发现模型属于高偏差,则应在现有样本上继续调整模型,具体 ...
- 吴恩达机器学习笔记(十一) —— Large Scale Machine Learning
主要内容: 一.Batch gradient descent 二.Stochastic gradient descent 三.Mini-batch gradient descent 四.Online ...
随机推荐
- 【推导】【数学期望】【冒泡排序】Petrozavodsk Winter Training Camp 2018 Day 5: Grand Prix of Korea, Sunday, February 4, 2018 Problem C. Earthquake
题意:两地之间有n条不相交路径,第i条路径由a[i]座桥组成,每座桥有一个损坏概率,让你确定一个对所有桥的检测顺序,使得检测所需的总期望次数最小. 首先,显然检测的时候,是一条路径一条路径地检测,跳跃 ...
- weighttp 使用
Weighttp 地址 http://redmine.lighttpd.net/projects/weighttp/wiki Weighttp的介绍:weighttp is a lightweigh ...
- CentOS下重新安装yum的方法
不小心误删除了VPS下面的yum,大家都知道yum在linux中是很重要的一个功能,软件的下载,系统的更新都要靠他.没有yum,系统基本处于半残废状态. yum的安装操作: 在SSH里面依次输入下面的 ...
- 【BZOJ】4720: [Noip2016]换教室
4720: [Noip2016]换教室 Time Limit: 20 Sec Memory Limit: 512 MBSubmit: 1690 Solved: 979[Submit][Status ...
- bzoj 1027 floyd求有向图最小环
结合得好巧妙.... 化简后的问题是: 给你两个点集A,B,求B的一个子集BB,使得BB的凸包包含A的凸包,求BB的最小大小. 先特判答案为1,2的情况,答案为3的情况,我们先构造一个有向图: 对于B ...
- poj 3164
朱刘算法 步骤: 1.计算出每个点边权最小的边的权(如果除根以外有其他的点没有入边,则不存在最小树形图),并记下边的另一个端点(称其为这个点的前趋) 2.沿着每个点向上走,如果在走到根或环上的点之前, ...
- Codeforces Round #354 (Div. 2) E. The Last Fight Between Human and AI 数学
E. The Last Fight Between Human and AI 题目连接: http://codeforces.com/contest/676/problem/E Description ...
- TYVJ 1463 智商问题 分块
TYVJ 1463 智商问题 Time Limit: 1.5 Sec Memory Limit: 512 MB 题目连接 http://www.tyvj.cn/p/1463 背景 各种数据结构帝~ ...
- HTTP请求头和响应头
这篇文章简单总结一下HTTP请求头和响应头,并举一些web开发中响应头的用例. 1. HTTP请求头 accept:浏览器通过这个头告诉服务器,它所支持的数据类型.如:text/html, ima ...
- UITableView分页
UITableView分页上拉加载简单,ARC环境,源码如下,以作备份: 原理是,点击最后一个cell,触发一个事件来处理数据,然后reloadData RootViewController.m + ...