属性与特征:

  attribute: e.g., 'Mileage'

  feature: an attribute plus its value, e.g., 'Mileage = 15000'

Note that some regression algorithm can be used for classification as well,and vice versa. For example,Logistic Regression is commonly used for classification,as it can output a value that corresponds to the probability of belonging to a given class.

some of the important supervised learning algorithms:

  k-Nearest Neighbors

  Linear Regression

  Logistic Regression

  Support Vector Machines(SVMs)

  Decision Trees and Random Forests

  Neural networks

unsupervised learning:

  Clustering:

    k-Means

    Hierarchical Cluster Analysis(HCA)

    Expectation Maximization

  Visualization and dimensionality reduction

    Principal Component Analysis(PCA)

    Kernel PCA

    Locally-Linear Embedding(LLE)

    t-distributed Stochastic Neighbor Embedding(t-SNE)

  Association rule learning

    Apriori

    Eclat

online vs batch learning:

  whether or not the system can learn incrementally from a stream of incoming data.

  batch learning:

    it must be trained using all the available data. it will generally takes a lot of time and computing resources.

  online learning:

    feed data by either individually or mini-batches.

    it it great for systems that receive data as a continuous flow and need to adapt to change rapidly or autonomously.

    once it has learned about new data instances,it does not need them anymore,so you can discard them. this can save a huge amount of space.

    one important parameter of online learning: learning rate,how fast is should adapt to changing data. if you set a high learning rate,then your system will rapidly adapt to new data,but it will also tend to quickly forget the old data. conversely,if you set a low learning rate,the system will learn more slowly,but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points.

    a big challenge with online learning is that if bad data is fed to the system,the system‘s performance will gradually decline. to reduce this risk,you need to monitor your system closely and promptly switch learning off(and possibly revert to a previously working state) if you detect a drop in performance. you may also want to monitor the input data and react to abnormal data(e.g., using an anomaly detection algorithm).

model-based vs instance-based learning:

  the approach to generalization.

  model-based:

    it tunes some parameters to fit the model to the training set,and then hopefully it will be able to make good predictions on new cases as well.

    measure: the cost function

  intance-based:

    it just learns the examples by heart and uses a similarity measure to generalize to new instances.

    e.g., a similarity measure between two emails could be to count the number of words they have in common.

Main Challenge of Machine Learning: bad algorithm and bad data

1). insufficient quantity of training data

  data matters more than algorithm for complex problems

  however, that small- and medium-sized datasets are still very common,and it is not always easy or cheap to get extra training data,so don't abandon algorithms just yet.

2). nonrepresentatIve training data

  it is crucial that your training data be representative of the new cases you want to generalize to.

3). poor-quality data

  if some instances are clearly outliers,it may help to simply discard them or try to fix the errors manually.

  if some instances are missing a few features,you must decide whether you want to ignore this attribute altogether,ignore theseinstances,fill in the missing values,or train one model with the features and one model without it,and so on.

4). irrelevant features

  features selection:selecting the most useful features to train on among existing features.

  features extraction:combining existing features to produce a more useful one.

  creating new features by gathering new data.

5). overfitting the training data

  overfitting happens when the model is too complex relative to the amount and noiseness of the training data.

  the solutions:

    (1). to simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data or by constraining the model(regularization).

    (2). to gather more training data

    (3). to reduce the noise in the training data(e.g., fix data errors and remove outliers)

6). underfitting the training data

  the opposite of overfitting. its predictions are bound to be inaccurate even on the training examples.

  the solutions:  对应overfitting的第一种情况

    (1). selecting a more powerful model,with more parameters

    (2). feeding better features to the learning algorithm

    (3). reducing the constraints on the model(e.g., reducing the regularization hyperparameter)

a hyperparameter is a parameter of a learning algorithm(not of the model). it is not affected by the learning algorithm itself;it must be set prior to training and remains constant during training.

testing and validating:

  the error rate (of a model) on new cases is called the generalization error(or out-of-samle error).

  evaluating a model is simple enough:just use a test set.

  suppose you are hesitating between two models(say a linear model and a polynomial model): how can you decide? one option is to train both and compare how well they generalize using the test set.

  suppose that the linear model generalize better,but you want to apply some regularization to avoid overfitting. the question is: how do you choose the value of the regularization hyperparameter? one option is to train 100 different models using 100 different values for this hyperparameter.(导致了下面这个问题)

  suppose you find the best hyperparameter value that produces a model with the lowest generalization error,say just 5% error. so you launch this model into production,but unfortunately it does not performs as well as expected and produces 15% errors. what just happened? the problem is that you measured the generalization error multiple times on the test,and you adapted the model and hyperparameters to produce the best model for that set. this means that the model is unlikely to perform as well on new data.

  (接上)a common solution to this problem is to have a second holdout set called the validation set. you train multiple models with various hyperparameters using the training set,and select the model and hyperparameters that perform best on the validation set,and when you're happy with your model you run a single final test against the test set to get an estimate of the generalization error.

  to avoid ‘wasting’ too much training data in validation sets,a common technique is to use cross-validation:the training set is split into complementary subsets,and each model is trained against a different combination of these subsets and validated against the remaining parts. once the model type and hyperparameters have been selected,a final model is trained using these hyperparameters on the full training set,and the generalized error is measured on the test set.

No Free Lunch: assumption

  a model is a simplified version of the observation. the simplifications are means to discard the superfluous details that are unlikely to generalize to new instances. to decide what data to discard and what data to keep,you must make assumptions. for example,a Linear model makes the assumption that the data is fundamentally linear and that the distance between the instances and the straight line is just noise,which can safely be ignored.  if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. this is called No Free Lunch(NFL) theorem.

壁虎书1 The Machine Learning Landscape的更多相关文章

  1. 壁虎书2 End-to-End Machine Learning Project

    the main steps: 1. look at the big picture 2. get the data 3. discover and visualize the data to gai ...

  2. 第一章——机器学习总览(The Machine Learning Landscape)

    本章介绍了机器学习的一些基本概念,已经应用场景.这部分知识在其它地方也经常看到,不再赘述. 这里只记录一些作者提到的,有趣的知识点. 回归(regression)名字的来源:这是由Francis Ga ...

  3. Hands-On Machine Learning with Scikit-Learn and TensorFlow---读书笔记

    去年在北京参加了一次由O'Reilly和Cloudera联合举办的大数据会议Strata Data Conference,并有幸获得了O'Reilly出版的Hands-On Machine Learn ...

  4. 壁虎书7 Ensemble Learning and Random Forests

    if you aggregate the predictions of a group of predictors,you will often get better predictions than ...

  5. 壁虎书5 Support Vector Machine

    SVM is capable of performing linear or nonlinear classification,regression,and even outlier detectio ...

  6. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  7. [Machine Learning & Algorithm]CAML机器学习系列2:深入浅出ML之Entropy-Based家族

    声明:本博客整理自博友@zhouyong计算广告与机器学习-技术共享平台,尊重原创,欢迎感兴趣的博友查看原文. 写在前面 记得在<Pattern Recognition And Machine ...

  8. machine learning基础与实践系列

    由于研究工作的需要,最近在看机器学习的一些基本的算法.选用的书是周志华的西瓜书--(<机器学习>周志华著)和<机器学习实战>,视频的话在看Coursera上Andrew Ng的 ...

  9. [Machine Learning] 国外程序员整理的机器学习资源大全

    本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C ...

随机推荐

  1. 工作经验-类型转换[ java.lang.String]

    错误代码: logger.info("1"); Map<String,String> zxmap = (Map) zxList.get(0); ybty.setFyze ...

  2. javascript没有长整型

    记录一下前几天踩坑的经历. 背景:一个项目某一版之后很多用easyui的表格控件treegrid渲染的表格都显示不出来了 奇怪的地方主要有以下几点: 项目在测试环境才会这样,在本机能够正常运行,多次重 ...

  3. (一)Java工程化--Maven基础

    Maven 读作['mevən] 翻译成中文是"内行,专家" Maven是什么 包依赖的前世今生: 原始的jar包引用--> ant --> maven. 是一种项目管 ...

  4. Java基础7-异常;jar包

    昨日内容回顾 多态:多种状态,多态的前提条件是继承或者是实现 面向接口编程 降低耦合度 接口是最低标准,接口可以多重继承 接口中的所有方法都是抽象方法 抽象类中可以没有抽象方法 匿名内部类对象 将对类 ...

  5. 前端基础之jQuery

    JavaScript和jQuery的区别 JavaScript是一门编程语言,我们用它来编写客户端浏览器脚本 jQuery是javascript的一个库,包含多个可重用的函数,用来辅助我们简化java ...

  6. asp.net core 通过ajax调用后台方法(非api)

    1.    在Startup.cs文件中添加:        services.AddMvc();            services.AddAntiforgery(o => o.Heade ...

  7. iOS开发之#impor与#include的区别

    #import比起#include的好处就是不会引起交叉编译 在 Objective-C中,#import 被当成 #include 指令的改良版本来使用.除此之外,#import 确定一个文件只能被 ...

  8. 基于Spring Boot框架开发的一个Mock

    背景:在项目后端接口开发还未完成,我们无法进行自动化接口用例的调试,希望与开发同步完成接口自动化用例的编写及调试,待项目转测后,可以直接跑自动化用例,提高测试效率. 选用的maven + Spring ...

  9. 20165328 实验四《Andriid应用开发》实验报告

    一.实验一:Andriod Stuidio的安装测试: 实验要求: 参考                                                                 ...

  10. Angular路由——辅助路由

    一.辅助路由语法 同时控制多个插座内容. 第一步: 模版上除了主插座,还需要声明一个带name属性的插座 第二步: 路由配置中配置name为aux的插座上可以显示哪些组件,比如显示xxx和yyy组件. ...