机器学习 1 linear regression 作业

话说学机器学习，不写代码就太扯淡了。好了，接着上一次的线性回归作业。

hw1作业的链接在这： http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/hw1.pdf

作业是预测台湾的PM2.5的指数，既然是回归问题，肯定是用的是上一节课的线性回归了。

以上数据我传到https://pan.baidu.com/s/1dFhwT13 上面了，供有兴趣的人做做。

实际上上述中分为训练数据和测试数据，都是CSV格式的，而且只用到PM2.5有用，其他的没什么用，同时通过测试数据才知道，

其实就是用前9个小时的PM2.5数据作为特征，来预测第10个小时的数据，将第10个小时的数据保存为csv格式，作为预测结果。

好了，不多说，上代码。我的开发环境还是win7+pycharm4.0

第一步，读取train.csv. 获取PM2.5的训练数据，一共240个训练数据，将前9个小时的数据作为特征，将第10个小时的数据作为标签

 # -*- coding:UTF-8 -*-

 __author__ = 'tao'

 import csv

 import cv2

 import sys

 import numpy as np

 import math

 filename = 'F:/台湾机器学习/data/train.csv'

 ufilename = unicode(filename , "utf8") #这一块主要是因为汉字路径 也就是python调用open打开文件时，其路径必须要转换为utf-8格式

 list=[]

 result=[]

 row=0

 colum=0;

 with open(ufilename, 'r') as f:

     data = f.readlines()  #dat中所有字符串读入data

     for line in data:

         odom = line.split(',')        #将单个数据分隔开存好

         colum=len(odom)

         if 'PM2.5'in odom:

             lists= map(int, odom[3:12])#第三个开始开始数据  一直取9个数

             results= map(int, odom[12:13])#取第10个数

             list.append(lists)

             result.append(results)

             # print odom

         row=row+1

 #print("原始数据是：{0}行 ：{1}列 的数据".format(row, colum))

 print("有{0}个训练数据".format(len(list)))

第二步，利用梯度下降来训练权值和偏置。

#y=w0*x0+w1*x1+w2*x2+w3*x3+w4*x4+w5*x5+w6*x6+w7*x7+w8*x8+b0

#

alpha=0.0001

b_0=np.random.rand(1,1)

th_0 = np.random.rand(1,1);

th_1 = np.random.rand(1,1);

th_2 = np.random.rand(1,1);

th_3 = np.random.rand(1,1);

th_4=  np.random.rand(1,1);

th_5 = np.random.rand(1,1);

th_6 = np.random.rand(1,1);

th_7 = np.random.rand(1,1);

th_8 = np.random.rand(1,1);

for k in range(1000):

    length = len(list)

    jtheta = 0

    total = 0

    sum_total = 0

    for id in range(length):

        # print("当前序号{0}训练数据".format(id))

        xset= np.array(list[id]) #一行 X数值

        yset= np.array(result[id]) # 要估计值

        total = total + b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]- yset

        # print( "当前误差{0}".format(b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]- yset))

        tmpb0 = b_0  - alpha/length*(total)

        tmp0 = th_0  -  alpha/length*(total)*xset[0]

        tmp1 = th_1  -  alpha/length*(total)*xset[1]

        tmp2 = th_2  -  alpha/length*(total)*xset[2]

        tmp3 = th_3  -  alpha/length*(total)*xset[3]

        tmp4 = th_4  -  alpha/length*(total)*xset[4]

        tmp5 = th_5  -  alpha/length*(total)*xset[5]

        tmp6 = th_6  -  alpha/length*(total)*xset[6]

        tmp7 = th_7  -  alpha/length*(total)*xset[7]

        tmp8 = th_8  -  alpha/length*(total)*xset[8]

        b_0 = tmpb0

        th_0 = tmp0

        th_1 = tmp1

        th_2 = tmp2

        th_3 = tmp3

        th_4 = tmp4

        th_5 = tmp5

        th_6 = tmp6

        th_7 = tmp7

        th_8 = tmp8

        sum_total = sum_total + b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8] - yset

        jtheta_1 = 0.5 * length * math.pow(sum_total,2)

        comp = math.fabs(jtheta_1 - jtheta)

        if id==length-1:

                print "%10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f \n" %(comp,jtheta * dgree,b_0,th_0,th_1,th_2,th_3,th_4,th_5,th_6,th_7,th_8)

        jtheta = jtheta_1

#

print("-训练得到的权值如下--")

print " %10.5f %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f \n" %(b_0,th_0,th_1,th_2,th_3,th_4,th_5,th_6,th_7,th_8)

第三步，测试训练集。这个可以不需要，是我调试过程中看，对训练集的预测精度怎么样？

 #测试训练集

 for k in range(len(list)):

     xset = np.array(list[k])

     nptresult= np.array(result[k])

     # print(xset)

     # print("预测数据{0}".format( b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]))

     # print("真实数据{0}".format(nptresult))

     error= b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]-nptresult

     print("训练集的实际误差{0}".format(error))

第四步，运行测试集，并保存测试结果。

首先读取测试集的数据，和训练集一样

 #读取测试集数据

 testfilename = 'F:/台湾机器学习/data/test_X.csv'

 utestfilename = unicode(testfilename , "utf8") #这一块主要是因为汉字路径 也就是python调用open打开文件时，其路径必须要转换为utf-8格式

 testlist=[]

 testrow=0

 testcolum=0;

 with open(utestfilename, 'r') as f:

     data = f.readlines()  #dat中所有字符串读入data

     for line in data:

         odom = line.split(',')        #将单个数据分隔开存好

         colum=len(odom)

         if 'PM2.5'in odom:

             testlists= map(int, odom[2:11])#第三个开始开始数据  一直取9个数

             testlist.append(testlists)

             # print odom

         testrow=row+1

 print("测试数据是：{0}行 ：{1}列 的数据".format(testrow, testcolum))

 print("有{0}个测试数据".format(len(testlist)))

 print(testlist)

保存预测结果到csv文件中：

 #输出最后的测试结果

 csvfile = file('d:\\csv_result.csv', 'wb')

 writer = csv.writer(csvfile)

 writer.writerow(['id', 'value'])

 for k in range(len(testlist)):

     id_list=[]

     xset = np.array(testlist[k])

     result= b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]

     int_result = int(result)

     if(int_result<0):

         int_result=0

     id_list = [('id_{0}'.format(k), '{0}'.format(int_result))]

     print(id_list)

     writer.writerows(id_list)

 csvfile.close()

完整的程序：

 # -*- coding:UTF-8 -*-

 __author__ = 'tao'

 import csv

 import cv2

 import sys

 import numpy as np

 import math

 filename = 'F:/台湾机器学习/data/train.csv'

 ufilename = unicode(filename , "utf8") #这一块主要是因为汉字路径 也就是python调用open打开文件时，其路径必须要转换为utf-8格式

 list=[]

 result=[]

 row=0

 colum=0;

 with open(ufilename, 'r') as f:

     data = f.readlines()  #dat中所有字符串读入data

     for line in data:

         odom = line.split(',')        #将单个数据分隔开存好

         colum=len(odom)

         if 'PM2.5'in odom:

             lists= map(int, odom[3:12])#第三个开始开始数据  一直取9个数

             results= map(int, odom[12:13])#取第10个数

             list.append(lists)

             result.append(results)

             # print odom

         row=row+1

 #print("原始数据是：{0}行 ：{1}列 的数据".format(row, colum))

 print("有{0}个训练数据".format(len(list)))

 #y=w0*x0+w1*x1+w2*x2+w3*x3+w4*x4+w5*x5+w6*x6+w7*x7+w8*x8+b0

 #

 alpha=0.0001

 b_0=np.random.rand(1,1)

 th_0 = np.random.rand(1,1);

 th_1 = np.random.rand(1,1);

 th_2 = np.random.rand(1,1);

 th_3 = np.random.rand(1,1);

 th_4=  np.random.rand(1,1);

 th_5 = np.random.rand(1,1);

 th_6 = np.random.rand(1,1);

 th_7 = np.random.rand(1,1);

 th_8 = np.random.rand(1,1);

 for k in range(1000):

     length = len(list)

     jtheta = 0

     total = 0

     sum_total = 0

     for id in range(length):

         # print("当前序号{0}训练数据".format(id))

         xset= np.array(list[id]) #一行 X数值

         yset= np.array(result[id]) # 要估计值

         total = total + b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]- yset

         # print( "当前误差{0}".format(b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]- yset))

         tmpb0 = b_0  - alpha/length*(total)

         tmp0 = th_0  -  alpha/length*(total)*xset[0]

         tmp1 = th_1  -  alpha/length*(total)*xset[1]

         tmp2 = th_2  -  alpha/length*(total)*xset[2]

         tmp3 = th_3  -  alpha/length*(total)*xset[3]

         tmp4 = th_4  -  alpha/length*(total)*xset[4]

         tmp5 = th_5  -  alpha/length*(total)*xset[5]

         tmp6 = th_6  -  alpha/length*(total)*xset[6]

         tmp7 = th_7  -  alpha/length*(total)*xset[7]

         tmp8 = th_8  -  alpha/length*(total)*xset[8]

         b_0 = tmpb0

         th_0 = tmp0

         th_1 = tmp1

         th_2 = tmp2

         th_3 = tmp3

         th_4 = tmp4

         th_5 = tmp5

         th_6 = tmp6

         th_7 = tmp7

         th_8 = tmp8

         sum_total = sum_total + b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8] - yset

         jtheta_1 = 0.5 * length * math.pow(sum_total,2)

         comp = math.fabs(jtheta_1 - jtheta)

         if id==length-1:

                 print "%10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f \n" %(comp,jtheta * dgree,b_0,th_0,th_1,th_2,th_3,th_4,th_5,th_6,th_7,th_8)

         jtheta = jtheta_1

 #

 print("-训练得到的权值如下--")

 print " %10.5f %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f \n" %(b_0,th_0,th_1,th_2,th_3,th_4,th_5,th_6,th_7,th_8)

 #测试训练集

 for k in range(len(list)):

     xset = np.array(list[k])

     nptresult= np.array(result[k])

     # print(xset)

     # print("预测数据{0}".format( b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]))

     # print("真实数据{0}".format(nptresult))

     error= b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]-nptresult

     print("训练集的实际误差{0}".format(error))

 #读取测试集数据

 testfilename = 'F:/台湾机器学习/data/test_X.csv'

 utestfilename = unicode(testfilename , "utf8") #这一块主要是因为汉字路径 也就是python调用open打开文件时，其路径必须要转换为utf-8格式

 testlist=[]

 testrow=0

 testcolum=0;

 with open(utestfilename, 'r') as f:

     data = f.readlines()  #dat中所有字符串读入data

     for line in data:

         odom = line.split(',')        #将单个数据分隔开存好

         colum=len(odom)

         if 'PM2.5'in odom:

             testlists= map(int, odom[2:11])#第三个开始开始数据  一直取9个数

             testlist.append(testlists)

             # print odom

         testrow=row+1

 print("测试数据是：{0}行 ：{1}列 的数据".format(testrow, testcolum))

 print("有{0}个测试数据".format(len(testlist)))

 print(testlist)

 #输出最后的测试结果

 csvfile = file('d:\\csv_result.csv', 'wb')

 writer = csv.writer(csvfile)

 writer.writerow(['id', 'value'])

 for k in range(len(testlist)):

     id_list=[]

     xset = np.array(testlist[k])

     result= b_0 + th_0 * xset[0]+ th_1 * xset[1]+ th_2 * xset[2]+ th_3 * xset[3]+ th_4 * xset[4]+ th_5 * xset[5]+ th_6 * xset[6]+ th_7 * xset[7]+ th_8 * xset[8]

     int_result = int(result)

     if(int_result<0):

         int_result=0

     id_list = [('id_{0}'.format(k), '{0}'.format(int_result))]

     print(id_list)

     writer.writerows(id_list)

 csvfile.close()

又试了试 batch gradual descent，貌似没发现什么新的东西

#y=w0*x0+w1*x1+w2*x2+w3*x3+w4*x4+w5*x5+w6*x6+w7*x7+w8*x8+b0

#

alpha=0.0001

b_0=np.random.rand(1,1)

th = np.random.rand(1,9);

batch=20

for k in range(5000):

    length = len(list)

    jtheta = 0

    total = 0

    sum_total = 0

    count=0

    for j in range(batch): #batch

        # print("当前序号{0}训练数据".format(id))

        xset= np.array(list[j+count*batch]) #一行 X数值

        yset= np.array(result[j+count*batch]) # 要估计值

        total = total+b_0 +np.dot(th,xset)- yset

        # print( "当前误差{0}".format(b_0 +np.dot(th,xset)- yset))

    b_0 = b_0  - alpha/batch*(total)

    th = th  -  alpha/batch*(total)*xset

    count = count +1

    if(count>=len(list)/batch):

      break;

    if(j==batch-1):

        print " %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f \n" %(b_0,th[0][0],th[0][1],th[0][2],th[0][3],th[0][4],th[0][5],th[0][6],th[0][7],th[0][8])

#

print("-训练得到的权值如下--")

print" %10.5f %10.5f  %10.5f %10.5f   %10.5f  %10.5f  %10.5f %10.5f   %10.5f  %10.5f \n" %(b_0,th[0][0],th[0][1],th[0][2],th[0][3],th[0][4],th[0][5],th[0][6],th[0][7],th[0][8])

机器学习 1 linear regression 作业的更多相关文章

机器学习 1 linear regression 作业(二)
这个线性回归的作业需要上传到https://inclass.kaggle.com/c/ml2016-pm2-5-prediction 上面,这是一个kaggle比赛的网站.第一次接触听说这个东西,恰好 ...
Andrew Ng机器学习编程作业: Linear Regression
编程作业有两个文件 1.machine-learning-live-scripts(此为脚本文件方便作业) 2.machine-learning-ex1(此为作业文件) 将这两个文件解压拖入matla ...
从零单排入门机器学习：线性回归（linear regression）实践篇
线性回归(linear regression)实践篇之前一段时间在coursera看了Andrew ng的机器学习的课程,感觉还不错,算是入门了. 这次打算以该课程的作业为主线,对机器学习基本知识做 ...
ufldl学习笔记与编程作业：Linear Regression（线性回归）
ufldl学习笔记与编程作业:Linear Regression(线性回归) ufldl出了新教程,感觉比之前的好.从基础讲起.系统清晰,又有编程实践. 在deep learning高质量群里面听一些 ...
Stanford机器学习---第二讲. 多变量线性回归 Linear Regression with multiple variable
原文:http://blog.csdn.net/abcjennifer/article/details/7700772 本栏目(Machine learning)包括单参数的线性回归.多参数的线性回归 ...
Stanford机器学习---第一讲. Linear Regression with one variable
原文:http://blog.csdn.net/abcjennifer/article/details/7691571 本栏目(Machine learning)包括单参数的线性回归.多参数的线性回归 ...
Coursera台大机器学习课程笔记8 -- Linear Regression
之前一直在讲机器为什么能够学习,从这节课开始讲一些基本的机器学习算法,也就是机器如何学习. 这节课讲的是线性回归,从使Ein最小化出发来,介绍了 Hat Matrix,要理解其中的几何意义.最后对比了 ...
机器学习之多变量线性回归（Linear Regression with multiple variables）
1. Multiple features(多维特征) 在机器学习之单变量线性回归(Linear Regression with One Variable)我们提到过的线性回归中,我们只有一个单一特征量 ...
斯坦福机器学习视频笔记 Week1 Linear Regression and Gradient Descent
最近开始学习Coursera上的斯坦福机器学习视频,我是刚刚接触机器学习,对此比较感兴趣:准备将我的学习笔记写下来, 作为我每天学习的签到吧,也希望和各位朋友交流学习. 这一系列的博客,我会不定期的更 ...

随机推荐

mysql解决其他服务器不可连接问题
在安装mysql的机器上运行: 1.d:\mysql\bin\>mysql -h localhost -u root //这样应该可以进入MySQL服务器 2.mysql> ...
jetty项目中静态文件不能修改问题
修改web.xml 在web.xml中加入如下代码: <servlet> <servlet-name>default</servlet-name> <serv ...
cron表达式
Cron表达式是一个字符串,字符串以5或6个空格隔开,分为6或7个域,每一个域代表一个含义,Cron有如下两种语法格式: Seconds Minutes Hours DayofMonth Month ...
C语言中的栈和堆
原文出处<http://blog.csdn.net/xiayufeng520/article/details/45956305#t0> 栈内存由编译器分配和释放,堆内存由程序分配和释放. ...
解读ASP.NET 5 & MVC6系列（3）：项目发布与部署
本章我们将讲解ASP.NET5项目发布部署相关的内容,示例项目以我们前一章创建的BookStore项目为例. 发布前的设置由于新版ASP.NET5支持多版本DNX运行环境的发布和部署,所以在部署之前 ...
nodejs事件轮询详述
目录概述 nodejs特点事件轮询关于异步方法概述关于nodejs的介绍网上资料非常多,最近由于在整理一些函数式编程的资料时,多次遇到nodejs有关的内容.所以就打算专门写一篇文章总结一下 ...
Azure PowerShell (5) 使用Azure PowerShell创建简单的Azure虚拟机和Linux虚拟机
<Windows Azure Platform 系列文章目录> 本文介绍的是国外的Azure Global.如果是国内由世纪互联运维的Azure China,请参考这篇文档: Azure ...
C语言 · 复习杂记
/*=================================*/ /* 基础部分 */ /*=================================*/一:.CPP--C++文件: ...
vSphere Client 编辑虚拟机属性的问题
编辑虚拟机属性的时候, 出现: vpxclient.vmconfig.cpuid 初始值设置异常之类的,重置了, 并将注册表中的所有vmvare 相关键值删除了, 还是一样的.. 后面参照https: ...
spring快速入门（一）
对于为什么使用spring框架,这里不多做解释,详情请百度.本人推荐面向驱动程序学习,通过实战来瞧瞧spring技术的伟大.所以先来看看原始开发一个简单的例子,由例子引入spring相关的技术.如果错 ...

机器学习 1 linear regression 作业

机器学习 1 linear regression 作业的更多相关文章

随机推荐

热门专题