代码及数据：https://github.com/zle1992/MachineLearningInAction

logistic regression

优点:计算代价不高，易于理解实现，线性模型的一种。

缺点:容易欠拟合，分类精度不高。但是可以用于预测概率。

适用数据范围:数值型和标称型。

准备数据：

 def loadDataSet():

     dataMat,labelMat = [],[]

     with open(filename,"r") as  fr:  #open file

         for line in fr.readlines():

             lineArr = line.split() #split each line

             dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])  #创建2维list

             labelMat.append(int(lineArr[2]))

     return dataMat,labelMat

1 基于Logistic回归和Sigmoid函数的分类

Sigmoid函数：

 def sigmoid(inX):

     return 1.0/(1+np.exp(-inX))

2基于最优化方法的最佳回归系数确定

梯度上升法：

梯度上升法的伪代码如下:
每个回归系数初始化为1
重复R次:
计算整个数据集的梯度
使用alpha x gradient更新回归系数的向量
返回回归系数

代码：

 def sigmoid(inX):

     return 1.0/(1+np.exp(-inX))

 def gradAscent(dataMat,labelMat):

     dataMatrix = np.mat(dataMat)  #translate list to matrix

     labelMatrix = np.mat(labelMat).transpose() #转置

     m,n = np.shape(dataMatrix) #100 rows  3 coulums

     alpha = 0.001 #步长 or 学习率

     maxCyclse = 500

     weight = np.ones((n,1)) #初始值随机更好吧

     #weight = np.random.rand(n,1)

     for k in range(maxCyclse):

         h = sigmoid(dataMatrix * weight) # h 是向量

         error = (labelMatrix - h)  #error 向量

         weight = weight + alpha * dataMatrix.transpose() *error  #更新

      #   print(k,"  ",weight)

     return weight

3分析数据:画出决策边界

 def plotfit(wei):

     import matplotlib.pyplot as plt

     weight = np.array(wei) #???????? #return array

     dataMat ,labelMat = loadDataSet()

     dataArr = np.array(dataMat)

     n = np.shape(dataArr)[0]  #row

     fig = plt.figure()   #plot

     ax = fig.add_subplot(111)

     ax.scatter(dataArr[:,1],dataArr[:,2],s =50, c = np.array(labelMat)+5) #散点图 #参考KNN 的画图

     x = np.arange(-3.0,3.0,0.1)   #画拟合图像

     y = (-weight[0] - weight[1] *x ) / weight[2]

     ax.plot(x,y)

     plt.xlabel("x1")

     plt.ylabel("x2")

     plt.show()

4训练算法:随机梯度上升

伪代码:
所有回归系数初始化为1
对数据集中每个样本
计算该样本的梯度
使用alpha x gradient更新回归系数值
返回回归系数值

原始梯度上升计算数据集的梯度，涉及的是矩阵运算。h,error都是向量

随机梯度算法计算的是数据集中每个样本的梯度，s计算量减小，h,error都是数值

 ef stocGradAscent0(dataMatrix,labelMatrix):

     m,n = np.shape(dataMatrix)

     alpha = 0.1

     weight = np.ones(n)

     for i in range(m):

         h = sigmoid(sum(dataMatrix * weight))

         error = labelMatrix[i] - h

         weight = weight + alpha * error * dataMatrix[i]

     return weight

上面的算法是固定的步长，固定的步长，不稳定，会产生震荡，所以下面的算法采用不固定的步长。

距离目标值远的时候，步长大，距离目标值近的时候，步长小。

 def stocGradAscent1(dataMat,labelMat,numIter = 150):

     dataMatrix = np.mat(dataMat)  #translate list to matrix

     labelMatrix = np.mat(labelMat).transpose() #转置

     m,n = np.shape(dataMat)

     alpha = 0.1

     weight = np.ones(n) #float

     #weight = np.random.rand(n)

     for j in range(numIter):

         dataIndex = list(range(m)) #range 没有del 这个函数　　所以转成list  del 见本函数倒数第二行

         for i in range(m):

             alpha = 4/(1.0 +j + i) + 0.01

             randIndex = int(np.random.uniform(0,len(dataIndex))) #random.uniform(0,5) 生成0-5之间的随机数

             #生成随机的样本来更新权重。

             h = sigmoid(sum(dataMat[randIndex] * weight))

             error = labelMat[randIndex] - h

             weight = weight + alpha * error * np.array(dataMat[randIndex])  #!!!!一定要转成array才行

             #dataMat[randIndex] 原来是list  list *2 是在原来的基础上长度变为原来2倍，

             del(dataIndex[randIndex]) #从随机list中删除这个

     return weight

5从病气病症预测病马的死亡率

 def classifyVector(inX,weight):  #输入测试带测试的向量 返回类别

     prob = sigmoid(sum(inX * weight))

     if prob > 0.5 :

         return 1.0

     else: return 0.0

 def colicTest():

     trainingSet ,trainingSetlabels =[],[]

     with open("horseColicTraining.txt") as frTrain:

         for lines in frTrain.readlines():

             currtline = lines.strip().split('\t')  # strip()remove the last string('/n') in everyline

             linearr = [] #每行临时保存str 转换float的list

             for i in range(21):   #将读进来的每行的前21个str 转换为float

                 linearr.append(float(currtline[i]))

             trainingSet.append(linearr)  #tianset 是2维的list

             trainingSetlabels.append(float(currtline[21]))#第22个是类别

     trainWeights = stocGradAscent1(trainingSet,trainingSetlabels,500)

     errorCount = 0

     numTestVec = 0.0

     with open("horseColicTest.txt") as frTrain:

         for lines in frTrain.readlines():

             numTestVec += 1.0

             currtline = lines.strip().split('\t')  # strip()remove the last string('/n') in everyline

             linearr = []  #测试集的每一行

             for i in range(21):

                 linearr.append(float(currtline[i]))#转换为float

             if int(classifyVector(np.array(linearr),trainWeights)) != int(currtline[21]) :

                 errorCount += 1  #输入带分类的向量，输出类别，类别不对，errorCount ++

             errorRate = float(errorCount)/numTestVec

             print("the error rate of this test is : %f"%errorRate)

     return errorRate

 def multiTest(): #所有测试集的错误率

     numTests = 10

     errorSum = 0.0

     for k in range(numTests):

         errorSum +=colicTest()

     print("after %d iterations the average error rate is : %f" %(numTests,errorSum/float(numTests)))

主函数:

 if __name__ == '__main__':

     filename = "testSet.txt"

     dataMat,labelMat = loadDataSet()

     #weight = gradAscent(dataMat,labelMat)

     weight = stocGradAscent1(dataMat,labelMat)

     print(weight)

     plotfit(weight)#画分类图像在小数据集上

     multiTest() #真实数据集上测试

机器学习实战python3 Logistic Regression的更多相关文章

机器学习实战python3 K近邻（KNN）算法实现
台大机器技法跟基石都看完了,但是没有编程一直,现在打算结合周志华的<机器学习>,撸一遍机器学习实战, 原书是python2 的,但是本人感觉python3更好用一些,所以打算用python ...
Andrew Ng机器学习编程作业:Logistic Regression
编程作业文件: machine-learning-ex2 1. Logistic Regression (逻辑回归) 有之前学生的数据,建立逻辑回归模型预测,根据两次考试结果预测一个学生是否有资格被大 ...
机器学习实战之Logistic回归
Logistic回归一.概述 1. Logistic Regression 1.1 线性回归 1.2 Sigmoid函数 1.3 逻辑回归 1.4 LR 与线性回归的区别 2. LR的损失函数 3. ...
Stanford机器学习笔记-2.Logistic Regression
Content: 2 Logistic Regression. 2.1 Classification. 2.2 Hypothesis representation. 2.2.1 Interpretin ...
Andrew Ng机器学习二： Logistic Regression
一:逻辑回归(Logistic Regression) 背景:假设你是一所大学招生办的领导,你依据学生的成绩,给与他入学的资格.现在有这样一组以前的数据集ex2data1.txt,第一列表示第一次测验 ...
【笔记】机器学习 - 李宏毅 - 6 - Logistic Regression
Logistic Regression 逻辑回归逻辑回归与线性回归有很多相似的地方.后面会做对比,先将逻辑回归函数可视化一下. 与其所对应的损失函数如下,并将求max转换为min,并转换为求指数形式 ...
机器学习实战 - python3 学习笔记（一） - k近邻算法
一. 使用k近邻算法改进约会网站的配对效果 k-近邻算法的一般流程: 收集数据:可以使用爬虫进行数据的收集,也可以使用第三方提供的免费或收费的数据.一般来讲,数据放在txt文本文件中,按照一定的格式进 ...
05机器学习实战之Logistic 回归scikit-learn实现
https://blog.csdn.net/zengxiantao1994/article/details/72787849似然函数原理:极大似然估计是建立在极大似然原理的基础上的一个统计方法,是概 ...
05机器学习实战之Logistic 回归
Logistic 回归概述 Logistic 回归或者叫逻辑回归虽然名字有回归,但是它是用来做分类的.其主要思想是: 根据现有数据对分类边界线(Decision Boundary)建立回归公式, ...

随机推荐

glob模块--查询一个文件名列表
''' 在python中,glob模块是用来查找匹配的文件的在查找的条件中,需要用到Unix shell中的匹配规则: * : 匹配所所有 ? : 匹配一个字符 *.* : 匹配如:[hello.t ...
Python做简单爬虫（urllib.request怎么抓取https以及伪装浏览器访问的方法）
一:抓取简单的页面: 用Python来做爬虫抓取网站这个功能很强大,今天试着抓取了一下百度的首页,很成功,来看一下步骤吧首先需要准备工具: 1.python:自己比较喜欢用新的东西,所以用的是Pyt ...
PyQt4网格布局
最通用的布局类别是网格布局(QGridLayout).该布局方式将窗口空间划分为许多行和列.要创建该布局方式,我们需要使用QGridLayout类. #!/usr/bin/python # -*- c ...
c++11实现l延迟调用（惰性求值）
惰性求值惰性求值一般用于函数式编程语言中,在使用延迟求值的时候,表达式不在它被绑定到变量之后就立即求值,而是在后面的某个时候求值. 可以利用c++11中的std::function, lam ...
const关键字浅析
1 const变量 const double PI = 3.14159; 定义之后不能被修改,所以定义时必须初始化. const int i, j = 0; // error: i is unini ...
java基础---->java中国际化的实现
应用程序的功能和代码设计考虑在不同地区运行的需要,其代码简化了不同本地版本的生产.开发这样的程序的过程,就称为国际化.今天,我们就开始学习java中国际化的代码实现. Java国际化主要通过如下3个类 ...
JS-鼠标跟随块（一个小圆点跟着鼠标跑）
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
【BZOJ2901】矩阵求和
Description 给出两个n*n的矩阵,m次询问它们的积中给定子矩阵的数值和. Input 第一行两个正整数n,m. 接下来n行,每行n个非负整数,表示第一个矩阵. 接下来n行,每行n个非负整数 ...
Asp SqlDataSource将数据库数据绑定在 GridView
1.首先认识一下GridView的几条属性 ☻AllowPaging 确定是否可以分页 ☻AllowSorting 确定是否可以进行排序 ☻AlternatingRowStyle 指定奇数行样式 ...
爬虫实战【3】Python-如何将html转化为pdf(PdfKit)
前言前面我们对博客园的文章进行了爬取,结果比较令人满意,可以一下子下载某个博主的所有文章了.但是,我们获取的只有文章中的文本内容,并且是没有排版的,看起来也比较费劲... 咋么办的?一个比较好的方法 ...

机器学习实战python3 Logistic Regression