《Machine Learning》系列学习笔记

第一周

第一部分 Introduction

The definition of machine learning

（1）older, informal definition——Arthur Samuel——"the field of study that gives computers the ability to learn without being explicitly programmed."

（2）modern definition ——Tom Mitchell——"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

The classification of machine learning

（1）Supervised learning

（2）Unsupervised learning.

Supervised Learning

The definition of supervised learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

The classification of supervised learning

（1）regression——map input variables to some continuous function.

（2）classification——map input variables into discrete categories

Example:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

The definition of unsupervised learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

The classification of unsupervised learning

（1）Clustering

（2）Non-clustering

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

第二部分 Model and Cost Function

Model Representation

x(i)——“input” variables (living area in this example), also called input features

y(i)——“output” or target variable

A pair (x(i),y(i))——a training example

training examples (x(i),y(i));i=1,...,m——training set

(i)——an index into the training set

X——the space of input values

Y——the space of output values

In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem.

When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

Cost Function

cost function("Squared error function"or "Mean squared error")——measure the accuracy of our hypothesis function.

J(θ0,θ1)=12m∑i=1m(y^i−yi)2=12m∑i=1m(hθ(xi)−yi)2

The following image summarizes what the cost function does:

Cost Function - Intuition I

如果我们试图在视觉术语中考虑它，我们的训练数据集散布在x-y平面上。我们试图建立一条通过这些散射数据点的直线（由hθ（x）定义）。

我们的目标是获得最好的线。最佳可能的线将是这样的，使得散射点与线的平均平方垂直距离将是最小的。理想情况下，线应该通过我们训练数据集的所有点。在这种情况下，J（θ0，θ1）的值将为0.以下示例显示了我们的成本函数为0的理想情况。

当θ1= 1时，我们得到一个斜率为1，它经过我们模型中的每一个数据点。相反，当θ1= 0.5时，我们看到从我们的拟合到数据点的垂直距离增加。

This increases our cost function to 0.58. Plotting several other points yields to the following graph:

as a goal, we should try to minimize the cost function. In this case, θ1=1 is our global minimum.

Cost Function - Intuition II

等高线图是包含许多轮廓线的图。两个可变函数的轮廓线在同一线的所有点处具有恒定值。这样的图的示例是下面的一个。

以任何颜色和沿着“圆”，人们将期望获得相同的成本函数值。例如，在上面绿色线上找到的三个绿点具有与J（θ0，θ1）相同的值，因此，它们沿着相同的线找到。当θ0= 800且θ1= -0.15时，圆圈x显示左侧图形的成本函数的值。取另一个h（x）并绘制其等值线图，得到以下图表：

当θ0= 360且θ1= 0时，轮廓图中的J（θ0，θ1）的值更接近中心，从而降低成本函数误差。现在给我们的假设函数稍微正的斜率导致更好的数据拟合。

上面的图表尽可能最小化成本函数，因此，θ1和θ0的结果分别趋向于大约0.12和250。在右边的图形上绘制这些值似乎将我们的点放在最内圈的中心。

第三部分 Parameter Learning

Gradient Descent

因此，我们有我们的假设函数，我们有一个方法来衡量它是否适合数据。现在我们需要估计假设函数中的参数。这就是梯度下降的来源。

想象一下，我们基于其领域θ0和θ1绘制我们的假设函数（实际上我们将成本函数绘制为参数估计的函数）。我们不是绘制x和y本身，而是我们的假设函数的参数范围和从选择一组特定的参数得到的成本。

我们把θ0放在x轴上，把θ1放在y轴上，成本函数放在垂直z轴上。我们的图上的点将是使用具有那些特定θ参数的我们的假设的成本函数的结果。下图描绘了这样的设置。

我们将知道，当我们的成本函数在图中的凹坑的最底部时，即当它的值是最小值时，我们已经成功。红色箭头显示图中的最小点。

我们这样做的方式是通过我们的成本函数的导数（一个函数的切线）。切线的斜率是在该点的导数，它将给我们移动的方向。我们在最陡下降的方向上逐步降低成本函数。每个步骤的大小由参数α确定，其被称为学习速率。

例如，上图中每个“星”之间的距离代表由我们的参数α确定的步长。较小的α将导致较小的阶跃，较大的α导致较大的阶跃。通过J（θ0，θ1）的偏导数来确定采取步骤的方向。根据图表上的起始位置，可以在不同的点结束。上图显示了两个不同的起点，分别位于两个不同的地方。

梯度下降算法是：

重复直到收敛：

θj:=θj−α∂∂θjJ(θ0,θ1)，j = 0,1表示特征索引号。

在每次迭代j时，应该同时更新参数θ1，θ2，...，θn。在第j次迭代计算另一个参数之前更新特定参数将导致错误的实现。

Gradient Descent Intuition

在这个视频中，我们探索了一个场景，其中我们使用一个参数θ1，并绘制其成本函数来实现梯度下降。我们对单个参数的公式是：

重复直到收敛：

θ1:=θ1−αd/dθ1J(θ1)

不管ddθ1J（θ1）的斜率的符号如何，θ1最终收敛到其最小值。下面的曲线图示出了当斜率为负时，θ1的值增加，而当其为正时，θ1的值减小。

在一个侧面说明，我们应该调整我们的参数α，以保证梯度下降算法在合理的时间收敛。失败会聚或太多的时间来获得的最小值意味着我们的步长大小是错误的。

梯度下降如何以固定步长收敛α？

收敛后的直觉是当我们接近我们的凸函数的底部时，ddθ1J（θ1）接近0。至少，导数将始终为0，因此我们得到：

θ1:=θ1−α∗0

Gradient Descent For Linear Regression

Note: [At 6:15 "h(x) = -900 - 0.1x" should be "h(x) = 900 - 0.1x"]

当具体应用于线性回归的情况下，可以导出梯度下降方程的新形式。我们可以替换我们的实际成本函数和我们的实际假设函数，并将公式修改为：

repeat until convergence: {

θ0:=θ0−α1m∑i=1m(hθ(xi)−yi)

θ1:=θ1−α1m∑i=1m((hθ(xi)−yi)xi)

}

其中m是训练集的大小，θ0是将与θ1同时改变的常数，xi，yi是给定训练集（数据）的值。

注意，我们已经将θj的两种情况分离为θ0和θ1的单独方程; 并且对于θ1，由于导数，我们在末端乘以xi。以下是单个示例的∂∂θjJ（θ）的推导：

这一切的要点是，如果我们开始猜测我们的假设，然后重复应用这些梯度下降方程，我们的假设将变得越来越准确。

因此，这是对原始成本函数J的简单梯度下降。该方法查看每个步骤的整个训练集中的每个示例，并且称为批量梯度下降。注意，虽然梯度下降通常易受局部最小值的影响，但我们在这里提出的用于线性回归的优化问题只有一个全局优化问题，没有其他局部优化问题; 因此梯度下降总是收敛（假设学习速率α不太大）到全局最小值。实际上，J是凸二次函数。这里是梯度下降的例子，因为它运行以最小化二次函数。

上面显示的椭圆是二次函数的轮廓。还示出了由梯度下降采取的轨迹，其在（48,30）处被初始化。图中的x（用直线连接）标记梯度下降经过的连续值，因为它收敛到其最小值。

《Machine Learning》系列学习笔记之第一周的更多相关文章

[Machine Learning]学习笔记-Logistic Regression
[Machine Learning]学习笔记-Logistic Regression 模型-二分类任务 Logistic regression,亦称logtic regression,翻译为" ...
Machine Learning 学习笔记
点击标题可转到相关博客. 博客专栏:机器学习 PDF 文档下载地址:Machine Learning 学习笔记机器学习 scikit-learn 图谱人脸表情识别常用的几个数据库机器学习 F1- ...
Machine Learning 学习笔记1 - 基本概念以及各分类
What is machine learning? 并没有广泛认可的定义来准确定义机器学习.以下定义均为译文,若以后有时间,将补充原英文...... 定义1.来自Arthur Samuel(上世纪50 ...
Structuring Machine Learning Projects 笔记
1 Machine Learning strategy 1.1 为什么有机器学习调节策略当你的机器学习系统的性能不佳时,你会想到许多改进的方法.但是选择错误的方向进行改进,会使你花费大量的时间,但是 ...
[Python & Machine Learning] 学习笔记之scikit-learn机器学习库
1. scikit-learn介绍 scikit-learn是Python的一个开源机器学习模块,它建立在NumPy,SciPy和matplotlib模块之上.值得一提的是,scikit-learn最 ...
Coursera 机器学习第6章（上） Advice for Applying Machine Learning 学习笔记
这章的内容对于设计分析假设性能有很大的帮助,如果运用的好,将会节省实验者大量时间. Machine Learning System Design6.1 Evaluating a Learning Al ...
《Machine Learning》系列学习笔记之第二周
第二周第一部分 Multivariate Linear Regression Multiple Features Note: [7:25 - θT is a 1 by (n+1) matrix an ...
machine learning学习笔记
看到Max Welling教授主页上有不少学习notes,收藏一下吧,其最近出版了一本书呢还,还没看过. http://www.ics.uci.edu/~welling/classnotes/clas ...
红帽学习笔记[RHCSA] 第一周
目录红帽学习笔记[RHCSA] 环境第一课关于Shell 命令的基础知识在终端中敲命令的快捷键本次课程涉及的命令第二课常用的目录结构与用途本次课程涉及到的命令第三课关于Linux的 ...

随机推荐

android: activity切换之效果
Activity是android应用的重要部分;为了提高用户的体验度,加了Activity之间切换的动画效果,现在介绍的一种切换动画: 是什么效果,大家自已动手测试一下便知道: 先看进入的动画: pa ...
Yii2 独立操作
看到最近有些人在问 yii2 独立操作相关的东西,在这做简单的说明吧, 平时核心业务逻辑一般用的还是比较少的.因为独立操作出现的原因是对重复被使用的操作进行简化,或分配一个额外处理逻辑的 ...
SQL SERVER 判断是否存在并删除某个数据库、表、视图、触发器、储存过程、函数
-- SQL SERVER 判断是否存在某个触发器.储存过程 -- 判断储存过程,如果存在则删除IF (EXISTS(SELECT * FROM sysobjects WHERE name='proc ...
快速实现python c扩展模块
1 python扩展模块的组成在python中,对于一些和系统相关的模块或者对性能要求很高的模块,通常会把这个模块C化.扩展模块中主要包含下面几个部分: init函数,函数名为:init+模块名, ...
web前端兼容性问题
1:position属性使用过多或使用位置不恰当引起滚动时页面错乱浏览器环境:ie7 position:relative; 网页上最直接表现就是极具破坏性的滚动错位,问题产生来自ie7自身渲染解析出 ...
oracle_权限
Oracle 权限权限允许用户访问属于其它用户的对象或执行程序,ORACLE系统提供三种权限:Object 对象级.System 系统级.Role 角色级.这些权限可以授予给用户.特殊用户publi ...
[Android ADB] An auto-input method for Android and Windows
The Valentine's Day is coming. Here is an auto-input method and you may use it to send multiple word ...
linux上安装Oracle 11g R2 标准版 64位
一.Oracle 安装前的准备检查一下包,必须全部安装: binutils-2.20.51.0.2-5.43.el6.x86_64 compat-libstdc++-296-2.96-144.el6 ...
Windows环境下安装配置Teamcity配合git自动发布mvc，webapi站点
以下是本人配置Teamcity具体环境和步骤,只实现了项目发布,打包.Nodejs npm gulp没有配置成功,后期补上. 1 环境安装本人使用的是windows7 sp1 64位系统,(.net ...
Java_HelloWorld
Java_HelloWorld 一.JDK安装与环境变量的设置可以在甲骨文公司的主页上直接下载. 链接:http://www.oracle.com/technetwork/java/javase/d ...

《Machine Learning》系列学习笔记之第一周