Machine Learning|Andrew Ng|Coursera 吴恩达机器学习笔记
Week1:
Machine Learning:
- A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
- Supervised Learning:We already know what our correct output should look like.
- Regression:Try to map input variables to some continuous function.
- Classification:Try to map input variables into discrete categories.
- Unsupervised Learning:We only have little or no idea what our results should look like.
- Clustering:Find a way to automatically group data into groups that are somehow similar or related by different variables.
- Non-clustering:Find structure in a chaotic environment,like the "Cocktail Party Algorithm".
Model Representation:
- x(i):Input features
- y(i):Target variable
- (x(i),y(i)):Training example
- (x(i),y(i));i=1,...,m:Training set
- m:Number of training examples
- h(x):Hypothesis,θ0+θ1x1
- This takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.
- Algorithm:
(The mean is halved 1/2 as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.)
- We use contour plot to show how to minimize the cost function.
- Help us to estimate the parameters in the hypothesis function.
- Algorithm:
(repeat until convergence)
- j=0,1:Feature index numbe
- α:Learning rate or the size of each step.If α is too small,gradient descent can be slow.If α is too large,gradient descent can overshoot the minimum.
- Partial Derivative of J:Direction of each step
- At each iteration j, one should simultaneously update all of the parameters.
Gradient Descent For Linear Regression:
- Algorithm:
- This method looks at every example in the entire training set on every step, and is calledbatch gradient descent.
- I have learned liner algebra in my college so I will skip this part in my note.
- n:number of features
- x(i):input of ith training example
- x(i)j:value of feature j in ith training example
- hθ(x):θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxn=
(assume x0 = 1)
- Algorithm:
- Feature Scaling:
- Feature Scaling:Dividing the input values by the range (max - min) of the input variable.Get every feature into approximately a -1 <= xi <= 1 range.
- Mean Normalization:Subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
Where μi is the average of all the values for feature i and si is the range of values (max - min), or si is the standard deviation.
- Learning Rate:Make a plot with number of iterations on the x-axis. and J(θ) on the y-axis.If J(θ) ever increases, then you probably need to decrease α.It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.To choose α,try 0.001,0.003,0.01......
- Features and Polynomial Regression:We can improve our features and the form of our hypothesis function in a couple different ways
- We can combine multiple features into one.We can get a new feature x3 by taking x1 * x2
- We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
- if you choose your features this way then feature scaling becomes very important.
- Formula:
- Example:
- There is no need to do feature scaling with the normal equation.
- If (X^TX) is non-invertibale:
- Delete redundant features such as x1 = size in feet^2 and x2 = size in m^2.
- Delete features to make sure that m > n or use regularization.
- The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.
- x(i):Feature
- y(i):Label for the tranning example
- We change the form for our hypotheses to satisfy 0 <= h(x) =1 by pluggin θ^Tx into the Logistic Function.
- Formula:
- Decision Boundary:The line that separates the area where y = 0 and where y = 1.It is created by hypothesis function(θ^Tx=0).
- Cost Function:
We can compress our cost function's two conditional cases into one case:
- Gradient Descent:
This algorithm is identical to the one we used in linear regression.But the h(x) is changed.
Optimization Algorithms:
- Conjugate gradient
- BFGS
- L-BGFS
- We can write codes below to use Octave's "fminunc()"
Multiclass Classification:
- Train a logistic regression classifier hθ(x) for each class to predict the probability that  y = i . To make a prediction on a new x, pick the class that maximizes hθ(x)
Overfitting:
- Even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor.
- Options to address overfitting:
- Reduce the number of features.
- Regularzation.
- Regularized Linear Regression:
- Cost Funcion:
(lambda is the regularization parameter.)
- Gradient Descent:
- Normal Equation:
- Regularized Logistic Regression:
- Cost Function:
- Gradient Descent:
Week4:
- If we had one hidden layer, it would look like:
- The values for each of the "activation" nodes:
- Each layer gets its own matrix of weights:
(The '+1' comes from the 'bias nodes',the output nodes will not include the bias nodes while the inputs will.)
- Vectorized:
- We can set different theta matrix to construct fundamental options by using a small neural network.
- We can construct more complex options by using hidden layers.
- Multiclass Classification:We use one-vs-all method and let hypothesis function return a vector of values.
- L:Total number of layers in the network
- Sl:Number of units (not counting bias unit) in layer l
- K:number of output units/classes
Backpropagation Algorithm:
- "Backpropagation" is neural-network terminology for minimizing our cost function.
- Algorithm:For t = 1 to m:
- We get
- Using code like this to unroll all the elements and put them into one long vector.
Using code like this to get back original matrices.
- Gradient Checking:We can approximate the derivative with respect to θj as follows:
- Training:
- Set 70% of date to be the training set and the remainning 30% to be the test set.
- In order to choose the model of your hypothesis, we can test each degree of polynomial by using cross validation set.(20% training set,20% cross validation set,60% test set)
- High bias is underfitting and high variance is overfitting.Ideally, we need to find a golden mean between these two.
- High Bias:
- High Variance:
- In order to choose the model and the regularization term λ, we need to:
- If a learning algorithm is suffering from high bias, getting more training data will not help much.
- If a learning algorithm is suffering from high variance, getting more training data is likely to help.
- A neural neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
- A large neural network with more parameters is prone to overfitting. It is also computationally expensive.
- The recommended approach:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
- It is very important to get error results as a single, numerical value.
- Precision
- Skewed Classes:The ratio of positive to negative examples is very close to one of two extremes.
(y = 1 in presence of rare class that we want to detect)
- Precision Rate:TP / (TP + FP)
- Recall Rate:TP / (TP + FN)
- F1 Score:(2 * P * R) / (P + R)
- Because constant doesn't change value of the theta that achieves the miinmum,so we multiplying objective function in logistic regression by M.
- We can both use (A + λB) or (CA + B) to control the relative.
- A support vector machine just makes a prediction of y being equal to one or zero, directly. So the hypothesis will predict one
- The SVM decision boundary will become like this:
- The black line gives SVM a robustness because it has a large margin:
- Given (xi,yi),we choose li = xi as landmarks,then let fi = sim(x,li).
- We compute new features depending on proximity to landmarks.So our function become theta0 + theta1*f1 + theta2*f2......
- Gaussian Kernels:
- C and Sigma:
- Do perform feature scaling before using the Gaussian kernel.
- Linear kernel:meanning no kernel.
Unsupervised Learning:
Clustering:
- We give unlabeled training set to an algorithm and we ask the algorithm find some structure in the data for us.
- K-meas Algorithm:
- Cost Function:
- Random Initialization:Randomly pick k training examples and set Mu1 of MuK equal to these k examples.
- Elbow Method:
- Better way to choose the number of clusters is to ask, for what purpose are you running K-means.
- Reason:Data compression or speed up our learning algorithm.
- Visualization:We can use dimensionality reduction to reduce data from high dimensions down to 2 or 3 dimensions,so that we can plot it and understand our data better.
- PCA:Find a lower dimensional surface onto which to project the data, so as to minimize the square distance between each point and the location of where it gets projected.
- Reduce from 2D to 1D:Find a vector onto which to project the data to minimize the projection error.
- Reduce from nD to kD:Find k vectors onto which to project the data to minimize the projection error.
- Data preprocessing:Feature scaling/Mean normalization
- Algorithm:
- If we want to reduce the data from n dimensions down to k dimensions, we need to do is take the first k vectors from U(n * n) as Ureduce(n * k).
- z = Ureduce' * x.
- Reconstruction from Compressed Representation:Xapprox = Ureduce * z.
- Applying:
(Only if your algorithm doesn't do what you want then implement PCA)
Anomaly Detection:
Density Estimation:
- We build a model of the probability of x,if p of x-test is less than some epsilon then we flag this as an anomaly.
- Gaussian Distribution(Normal Distribution):
,
- Parameter Estimation:
- Algorithm:
- Evaluation:Assume we have some labled data of anomalous and nonanomalous examples.Using training set(unlabled,assume normal examples),cross validation set and test set.
- Anomaly Detection vs. Supervised Learning:
- Non-gaussian Features:Let xNew = log(x)(logarithmic normal distribution),or xNew = x^(0.1)
- Choose Features:Choose features that migth take on unusually large or small values in the event of an anomaly
Multivariate Gaussian Distribution:
Recommender Systems:
- n.u = number of users
- n.m = number of moives
- r(i,j) = 1 if user j have rated movie i
- y(i,j) = rating given by user j to movie i(only if r(i,j) = 1)
- theta(j) = parameter vector for user j
- x(i) = feature vector for movie i
Content Based Recommendations:
- We assume we have features for different movies.
- For each user j,learn a parameter.Predict user j as rating movie i with
stars.
- Optimization Objective:
- Gradient Descent:
Collaborative Filtering:
- We assume that each of our users has told us how much they like the romantic movies and how much they like action packed movies.
- Optimization Algorithm:
- Given x and movie ratings can estimate theta.
- Given theta and movie ratings can estimate x.
- Optimization Objective:
- Mean Normalization:Compute the average rating that each movie obtained and subtract off the meaning rating.So the rating of movie become
+ average rating.
Week 10:
Large Scale Machine Learning:
Stochastic Gradient Descent:
- Algorithm:
- Randomly shuffle the data set.
- For i = 1...m:
- SGD will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first.
- We will usually take 1-10 passes through data set to get near the global minimum.
- Convergence:Plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs during the gradient descent iterations.
- One strategy for trying to actually converge at the global minimum is to slowly decrease α over time.
Mini-Batch Gradient Descent:
- Use b examples in each iteration.(b = mini-batch size)
- Algorithm:
- The advantage is that we can use vectorized implementations over the b examples.
Online Learning:
- With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y.
- You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta.
Map Reduce and Data Parallelism:
- Many learning algorithms can be expressed as computing sums of functions over the training set.
- We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel.
Week 11:
Photo OCR:
- Pipeline:
- Text detection
- Character segmentation
- Character classification
- Using sliding windows and expansion to text detection and character segmentation
- Ceiling Analysis
Artificial Data Synthesis:
- Creating new data from scratch(using the ramming funds as an example)
- Taking existing label examples and introducing distortions to it, to sort of create extra label examples.
Machine Learning|Andrew Ng|Coursera 吴恩达机器学习笔记的更多相关文章
- Machine Learning|Andrew Ng|Coursera 吴恩达机器学习笔记(完结)
Week 1: Machine Learning: A computer program is said to learn from experience E with respect to some ...
- Machine Learning - Andrew Ng - Coursera
Machine Learning - Andrew Ng - Coursera Contents 1 Notes 1 Notes What is Machine Learning? Two defin ...
- Coursera 学习笔记|Machine Learning by Standford University - 吴恩达
/ 20220404 Week 1 - 2 / Chapter 1 - Introduction 1.1 Definition Arthur Samuel The field of study tha ...
- Machine Learning——吴恩达机器学习笔记(酷
[1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...
- 吴恩达机器学习笔记(十一) —— Large Scale Machine Learning
主要内容: 一.Batch gradient descent 二.Stochastic gradient descent 三.Mini-batch gradient descent 四.Online ...
- 吴恩达机器学习笔记60-大规模机器学习(Large Scale Machine Learning)
一.随机梯度下降算法 之前了解的梯度下降是指批量梯度下降:如果我们一定需要一个大规模的训练集,我们可以尝试使用随机梯度下降法(SGD)来代替批量梯度下降法. 在随机梯度下降法中,我们定义代价函数为一个 ...
- 吴恩达机器学习笔记54-开发与评价一个异常检测系统及其与监督学习的对比(Developing and Evaluating an Anomaly Detection System and the Comparison to Supervised Learning)
一.开发与评价一个异常检测系统 异常检测算法是一个非监督学习算法,意味着我们无法根据结果变量
- 吴恩达机器学习笔记37-学习曲线(Learning Curves)
学习曲线就是一种很好的工具,我经常使用学习曲线来判断某一个学习算法是否处于偏差.方差问题.学习曲线是学习算法的一个很好的合理检验(sanity check).学习曲线是将训练集误差和交叉验证集误差作为 ...
- coursera吴恩达 机器学习编程作业原文件 及我的作业
保存在github上供广大网友下载:点击 8个zip,原文件,没有任何改动. 另外,不定期上传我自己关于这门课的学习过程笔记和心得,有兴趣的盆友可以点击这里查看.
随机推荐
- ueditor表格边框没有颜色的解决
问题: 用ueditor画表格,会发现表格存在,但是表格边框没有颜色. 解决方法: 需要对js文件中的样式进行修改,这里我引用的编辑器样式文件是ueditor.all.min.js,所以先找到该文件, ...
- word中正文分栏重新换页问题
小论文常需要正文分栏,但是标题.摘要不分栏的编排格式. 1.在摘要后面加入分隔符来将内容分为摘要和正文两个部分.选择 插入→分隔符→分节符(连续). 2.然后进行分栏.选择 格式→分栏. 3.此时如果 ...
- 04_Javascript初步第二天(下)
错误对象 try{ aa();//这是一个未被定义的方法 }catch(e){ alert(e.name+":"+e.message);//输出:ReferenceError:aa ...
- Date对象和正则对象
1.Date对象 创建 var date1 = new Date(); var date2 = new Date(12983798123);//填一个毫秒值,应该是距离1970年1月1日.....多少 ...
- python各种运算优先级一览表
##python各种运算的优先级 运算符 描述 lambda Lambda表达式 or 布尔"或" and 布尔"与" not x 布尔"非" ...
- root cause org.apache.ibatis.ognl.OgnlException: source is null for getProperty(null, "XXX")
在执行一个查询语句的时候,mybatis报错:root cause org.apache.ibatis.ognl.OgnlException: source is null for getProper ...
- UOJ#77. A+B Problem [可持久化线段树优化建边 最小割]
UOJ#77. A+B Problem 题意:自己看 接触过线段树优化建图后思路不难想,细节要处理好 乱建图无果后想到最小割 白色和黑色只能选一个,割掉一个就行了 之前选白色必须额外割掉一个p[i], ...
- POJ2396 Budget [有源汇上下界可行流]
POJ2396 Budget 题意:n*m的非负整数矩阵,给出每行每列的和,以及一些约束关系x,y,>=<,val,表示格子(x,y)的值与val的关系,0代表整行/列都有这个关系,求判断 ...
- CF 375D. Tree and Queries加强版!!!【dfs序分块 大小分类讨论】
传送门 题意: 一棵树,询问一个子树内出现次数$\ge k$的颜色有几种,Candy?这个沙茶自带强制在线 吐槽: 本来一道可以离散的莫队我非要强制在线用分块做:上午就开始写了然后发现思路错了...: ...
- BZOJ 1040: [ZJOI2008]骑士 [DP 环套树]
传送门 题意:环套树的最大权独立集 一开始想处理出外向树树形$DP$然后找到环再做个环形$DP$ 然后看了看别人的题解其实只要断开环做两遍树形$DP$就行了...有道理! 注意不连通 然后洛谷时限再次 ...