【Udacity】线性回归方程 Regression

Concept in English
Coding Portion
评估回归的性能指标——R平方指标
比较分类和回归

Continuous supervised learning 连续变量监督学习

Regression 回归

Continuous：有一定次序，且可以比较大小

一、Concept in English

Slope： 斜率

Intercept： 截距

coefficient:系数

二、Coding Portion

Google: sklearn regression



import numpy

import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(ages_train, net_worths_train)

### get Katie's net worth (she's 27)

### sklearn predictions are returned in an array, so you'll want to index into

### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not

### exact syntax, the point is the [0] at the end). In addition, make sure the

### argument to your prediction function is in the expected format - if you get

### a warning about needing a 2d array for your data, a list of lists will be

### interpreted by sklearn as such (e.g. [[27]]).

km_net_worth = 1.0 ### fill in the line of code to get the right value

km_net_worth = reg.predict([[27]])[0][0]

### get the slope

### again, you'll get a 2-D array, so stick the [0][0] at the end

slope = 0. ### fill in the line of code to get the right value

slope = reg.coef_[0][0]

#print reg.coef_

### get the intercept

### here you get a 1-D array, so stick [0] on the end to access

### the info we want

intercept = 0. ### fill in the line of code to get the right value

intercept = reg.intercept_[0]

### get the score on test data

test_score = 0. ### fill in the line of code to get the right value

test_score = reg.score(ages_test,net_worths_test)

### get the score on the training data

training_score = 0. ### fill in the line of code to get the right value

training_score = reg.score(ages_train,net_worths_train)

### print all the value

def submitFit():

    # all of the values in the returned dictionary are expected to be

    # numbers for the purpose of the grader.

    return {"networth":km_net_worth,

            "slope":slope,

            "intercept":intercept,

            "stats on test":test_score,

            "stats on training": training_score}

三、评估回归的性能指标

评估拟合程度

3.1 最小化误差平方和

SSE sum of Squared Errors

相关算法实现

1.Ordinary Least Squares(OLS,普通最小二乘法)

2.Gradient Descent (梯度下降算法)

不足: 添加的数据越多，误差平方的和必然增加，但并不代表拟合程度不好

解决方案： R平方指标

3.2 R平方指标

r平方越高，性能越好(MAX = 1)

定义： 有多少输出的改变能用输入的改变解释

优点： 与训练点的数量无关

Sklearn中的R平方

print "r-squared score:",reg.score(x,y)

R平方有可能小于0！

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

四、比较分类和回归

特性	监督分类	回归
输出类型	标签(离散)	值(连续)
寻找的结果(可视化)	决策边界	最佳拟合曲线
评判模型的标准	准确度	误差平方和or R平方指标