Machine Learning - XI. Machine Learning System Design机器学习系统的设计(Week 6)
http://blog.csdn.net/pipisorry/article/details/44119187
机器学习Machine Learning - Andrew NG courses学习笔记
Machine Learning System Design机器学习系统设计
Prioritizing What to Work On优先考虑做什么
the first decision we must make is how do we want to represent x, that is the features of the email.
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcGlwaXNvcnJ5/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" height="300" width="600">
Note:feature的选择
1. chose a hundred words to use for this representation manually.
2. in practice,look through a training set, and in the training set depict(描写叙述) the most frequently occurring n words where n is usually between ten thousand and fifty thousand, and use those as your features.
用数据预处理减少错误率
Note:
1. getting lots of data will often help, but not all the time.
2. when spammers send email,very often they will try to obscure(隐藏) the origins of the email, and maybe use fake email headers.Or send email through very unusual sets of computer service.Through very unusual routes, in order to get the spam to you.
3. the spam classifier might not equate "w4tches" as "watches," and so it may have a harder time realizing that something is spam with these deliberate misspellings.And this is why spammers do it.
Error Analysis 错误分析
{help give you a way to more systematically make some of these decisions of different ideas on how to improve the algorithm.quick way to let you identify some errors and quickly identify what are the hard examples so that you can focus your efforts on those.}
设计机器学习系统的建议步骤
Note:
error analysis on the emails would inspire you to design new features.Or they'll tell you whether the current things or current shortcomings of the system and give you the inspiration you need to come up with improvements to it.
错误分析的一个样例
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcGlwaXNvcnJ5/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" height="300" width="600">
Note:
1. 计算准确率Accuracy = (true positives + true negatives) / (total examples)推断
2. by counting up the number of emails in these different categories that you might discover, for example, that the algorithm is doing really particularly poorly on emails trying to steal passwords, and that may suggest that it might be worth your effort
to look more carefully at that type of email, and see if you can come up with better features to categorize them correctly.
3. a strong sign that it might actually be worth your while to spend the time to develop more sophisticated features based on the punctuation.
numerical evaluation of your learning algorithm
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcGlwaXNvcnJ5/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" height="300" width="600">
note:
1. using a stemming software can help but it can hurt.
2. We'll see later, examples where coming up with this, sort of, single row number evaluation metric may need a little bit more work.then let you make these decisions much more quickly.
Error Metrics for Skewed Classes有偏类的错误度量(准确度/召回率)
skewed class: in this case, the number of positive examples is much,much smaller than the number of negative examples.
Note:
1. So a non learning algorithm just predicting y equals 0 all the time is even better than the 1% error.
2. By going from 99.2% accuracy to 99.5% accuracy.we just need a good change to the algorithm or not?
it becomes much harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it's not always
clear if doing so is really improving the quality of your classifier because predicting y equals 0 all the time doesn't seem like a particularly good classifier.
faced with such a skewed classes therefore come up with a different error metric called precision recall.
Precision/Recall准确度/召回率
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcGlwaXNvcnJ5/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" height="300" width="600">
Note:
1. a learning algorithm that predicts y equals zero all the time,then recall equal to zero,recognize that just isn't a very good classifier.
2. defined setting y equals 1, rather than y equals 0, to be sort of that the presence of that rare class that we're trying to detect.
总结 : precision recall is often a much better way to evaluate our learning algorithms,than looking at classification error or classification accuracy, when the classes are
very skewed.
[1.6 误差类型Types of errors-常见的误差度量方法]
Trading Off Precision and Recall权衡精度和召回率
Note:
1. tell someone that we think they have cancer only if they're very confident.that instead of setting the threshold at 0.5.
2. the position recall curve can look like many different shapes, depending on the details of the classifier.
3. 推断threshole变化给P\R带来的影响: Lowering the threshold means more y = 1 predictions, 而recall的分母是不变的!
先看recall变大还是变小,再推断precision怎么变化
4. 准确率Accuracy = (true positives + true negatives) / (total examples)
A way to choose this threshold automatically?
How do we decide which of these algorithms is best?
A way of combining precision recall called the f score.
Data For Machine Learning数据影响机器学习算法的表现
{the issue of how much data to train on}
Note:
1. 而不是include high order polynomial features of x.
2. hopefully even though we have a lot of parameters but if the training set is sort of even much larger than the number of parameters then hopefully these albums will be unlikely to overfit.
3. Finally putting these two together that the train set error is small and the test set error is close to the training error what this two together imply is that hopefully the test set error will also be small.
4. A sufficiently large training set will not be overfit
总结:
if you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give a high performance learning algorithm.
Review:
from:http://blog.csdn.net/pipisorry/article/details/44245513
版权声明:本文博客原创文章,博客,未经同意,不得转载。
Machine Learning - XI. Machine Learning System Design机器学习系统的设计(Week 6)的更多相关文章
- 斯坦福第十一课:机器学习系统的设计(Machine Learning System Design)
11.1 首先要做什么 11.2 误差分析 11.3 类偏斜的误差度量 11.4 查全率和查准率之间的权衡 11.5 机器学习的数据 11.1 首先要做什么 在接下来的视频中,我将谈到机器 ...
- Ng第十一课:机器学习系统的设计(Machine Learning System Design)
11.1 首先要做什么 11.2 误差分析 11.3 类偏斜的误差度量 11.4 查全率和查准率之间的权衡 11.5 机器学习的数据 11.1 首先要做什么 在接下来的视频将谈到机器学习系 ...
- 11、 机器学习系统的设计(Machine Learning System Design)
11.1 首先要做什么 在接下来的视频中,我将谈到机器学习系统的设计.这些视频将谈及在设计复杂的机器学习系统时,你将遇到的主要问题.同时我们会试着给出一些关于如何巧妙构建一个复杂的机器学习系统的建议. ...
- Andrew Ng机器学习课程笔记(六)之 机器学习系统的设计
Andrew Ng机器学习课程笔记(六)之 机器学习系统的设计 版权声明:本文为博主原创文章,转载请指明转载地址 http://www.cnblogs.com/fydeblog/p/7392408.h ...
- 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 11—Machine Learning System Design 机器学习系统设计
Lecture 11—Machine Learning System Design 11.1 垃圾邮件分类 本章中用一个实际例子: 垃圾邮件Spam的分类 来描述机器学习系统设计方法.首先来看两封邮件 ...
- Coursera 机器学习 第6章(下) Machine Learning System Design 学习笔记
Machine Learning System Design下面会讨论机器学习系统的设计.分析在设计复杂机器学习系统时将会遇到的主要问题,给出如何巧妙构造一个复杂的机器学习系统的建议.6.4 Buil ...
- Machine Learning - 第6周(Advice for Applying Machine Learning、Machine Learning System Design)
In Week 6, you will be learning about systematically improving your learning algorithm. The videos f ...
- zz 机器学习系统或者SysML&DL笔记
机器学习系统或者SysML&DL笔记(一) Oldpan 2019年5月12日 0条评论 971次阅读 1人点赞 在使用过TVM.TensorRT等优秀的机器学习编译优化系统以及Py ...
- 机器学习系统或者SysML&DL笔记(一)
前言 在使用过TVM.TensorRT等优秀的机器学习编译优化系统以及Pytorch.Keras等深度学习框架后,总觉得有必要从理论上对这些系统进行一些分析,虽然说在实践中学习是最快最直接的(指哪儿打 ...
随机推荐
- Spring先进的交易管理困难剖析
1Spring事务传播行为 所谓事务传播行为就是多个事务方法相互调用时,事务怎样在这些方法间传播.Spring支持7种事务传播行为 PROPAGATION_REQUIRED(增加已有事务) 假设当前没 ...
- 【MongoDB】在windows平台mongodb切片集群(三)
在过去的两年我们博客详细阐述了零碎工作集群和打造过程.在这篇博客中,我们主要分析测试结果一起支离破碎集群. 首先来看看碎片集群的每个状态.你可以看出来复制集A和B都是正常的: 一.开启分片集合 开启一 ...
- PS多形式的部分之间复制“笨办法”
PS剪切页面,有时候你可能会遇到这样的情况:设计改进,但是,我们要具有相同的切片. 在此假设,可以直接用于切割片.我们可以节省大量的时间,又分为片. 但是,人们一般不会在你的上跨片设计PSD在变化,但 ...
- three.js 来源目光(十三)Math/Ray.js
商域无疆 (http://blog.csdn.net/omni360/) 本文遵循"署名-非商业用途-保持一致"创作公用协议 转载请保留此句:商域无疆 - 本博客专注于 敏捷开发 ...
- oracle11g的dmp文件导入oracle10g当误差:头验证失败---解决
原创作品,离 "深蓝的blog" 博客.欢迎转载,转载时请务必注明出处.否则追究版权法律责任. 深蓝的blog:http://blog.csdn.net/huangyanlong/ ...
- 至linuxNIC添加多个IP
由于工作的需要,只是有一个2以太网端口server构造3个月IP.制linux. 整理如以下的现在的过程,有离开后,,学生们也将不能够引用. IP1:10.110.97.68 IP2:10.115.2 ...
- SEO要领:8文章主持技巧(两)
续篇:搜索引擎优化要领:8条辅助技巧(一) 四.检查你的robots.txt文件 与谷歌的蜘蛛通信的经常用法是使用robots.txt文件. 这是一个文本文件.同意你告诉搜索引擎,你的站点的网页上抓取 ...
- jquery 元素控制(附加元素/其他内容)引进和应用
一个.在内部元素/外部附加元件 append,prepend:加入到该子元素 before,after:元素加入 html: <div id="content"> 在 ...
- CSDN专家吐槽实录
今天打开CSDN发现界面上的几个图标发生了变化,一个小小的变化,却引起了诸多CSDN专家对CSDN社区未来发展的思考,我特意从群里讲对话黏贴出来,希望各位能给予积极评价和建议. 你已经是群成员了,和大 ...
- Redis源代码分析(二十七)--- rio制I/O包裹
I/O每个操作系统,它的一个组成部分.和I/O业务质量,在一定程度上也影响了系统的效率. 今天,我在了解了Redis中间I/O的,相同的,Redis在他自己的系统中.也封装了一个I/O层.简称RIO. ...