基于baseline、svd和stochastic gradient descent的个性化推荐系统
koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长。考虑到写文章目地主要是已介绍总结方法为主,所以采用Movielens 数据集。
变量介绍
部分变量介绍可以参看《基于baseline和stochastic gradient descent的个性化推荐系统》
文章中,将介绍两种方法实现的简易个性化推荐系统,用RMSE评价标准,对比这两个方法的实验结果。
(1) svd + stochstic gradient descent 方法来实现系统。
(2) baseline + svd + stochastic gradient descent 方法来实现系统。
注:
方法1: svd + stochastic gradient descent
svd:
cost function:
梯度变化(利用stochastic gradient descent算法使上述的目标函数值,在设定的迭代次数内,降到最小)
具体代码实现:
'''''
Created on Dec 13, 2012 @Author: Dennis Wu
@E-mail: hansel.zh@gmail.com
@Homepage: http://blog.csdn.net/wuzh670 Data set download from : http://www.grouplens.org/system/files/ml-100k.zip
''' from operator import itemgetter, attrgetter
from math import sqrt
import random def load_data(): train = {}
test = {}
filename_train = 'data/ua.base'
filename_test = 'data/ua.test' for line in open(filename_train):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
train.setdefault(userId,{})
train[userId][itemId] = float(rating) for line in open(filename_test):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
test.setdefault(userId,{})
test[userId][itemId] = float(rating) return train, test def calMean(train):
stat = 0
num = 0
for u in train.keys():
for i in train[u].keys():
stat += train[u][i]
num += 1
mean = stat*1.0/num
return mean def initialFeature(feature, userNum, movieNum): random.seed(0)
user_feature = {}
item_feature = {}
i = 1
while i < (userNum+1):
si = str(i)
user_feature.setdefault(si,{})
j = 1
while j < (feature+1):
sj = str(j)
user_feature[si].setdefault(sj,random.uniform(0,1))
j += 1
i += 1 i = 1
while i < (movieNum+1):
si = str(i)
item_feature.setdefault(si,{})
j = 1
while j < (feature+1):
sj = str(j)
item_feature[si].setdefault(sj,random.uniform(0,1))
j += 1
i += 1
return user_feature, item_feature def svd(train, test, userNum, movieNum, feature, user_feature, item_feature): gama = 0.02
lamda = 0.3
slowRate = 0.99
step = 0
preRmse = 1000000000.0
nowRmse = 0.0 while step < 100:
rmse = 0.0
n = 0
for u in train.keys():
for i in train[u].keys():
pui = 0
k = 1
while k < (feature+1):
sk = str(k)
pui += user_feature[u][sk] * item_feature[i][sk]
k += 1
eui = train[u][i] - pui
rmse += pow(eui,2)
n += 1
k = 1
while k < (feature+1):
sk = str(k)
user_feature[u][sk] += gama*(eui*item_feature[i][sk] - lamda*user_feature[u][sk])
item_feature[i][sk] += gama*(eui*user_feature[u][sk] - lamda**item_feature[i][sk])
k += 1 nowRmse = sqrt(rmse*1.0/n)
print 'step: %d Rmse: %s' % ((step+1), nowRmse)
if (nowRmse < preRmse):
preRmse = nowRmse gama *= slowRate
step += 1 return user_feature, item_feature def calRmse(test, user_feature, item_feature, feature): rmse = 0.0
n = 0
for u in test.keys():
for i in test[u].keys():
pui = 0
k = 1
while k < (feature+1):
sk = str(k)
pui += user_feature[u][sk] * item_feature[i][sk]
k += 1
eui = pui - test[u][i]
rmse += pow(eui,2)
n += 1
rmse = sqrt(rmse*1.0 / n)
return rmse; if __name__ == "__main__": # load data
train, test = load_data()
print 'load data success' # initial user and item feature, respectly
user_feature, item_feature = initialFeature(100, 943, 1682)
print 'initial user and item feature, respectly success' # baseline + svd + stochastic gradient descent
user_feature, item_feature = svd(train, test, 943, 1682, 100, user_feature, item_feature)
print 'svd + stochastic gradient descent success' # compute the rmse of test set
print 'the Rmse of test test is: %s' % calRmse(test, user_feature, item_feature, 100)
方法2:baseline + svd + stochastic gradient descent
baseline + svd:
object function:
梯度变化(利用stochastic gradient descent算法使上述的目标函数值,在设定的迭代次数内,降到最小)
方法2: 具体代码实现
'''''
Created on Dec 13, 2012 @Author: Dennis Wu
@E-mail: hansel.zh@gmail.com
@Homepage: http://blog.csdn.net/wuzh670 Data set download from : http://www.grouplens.org/system/files/ml-100k.zip
''' from operator import itemgetter, attrgetter
from math import sqrt
import random def load_data(): train = {}
test = {}
filename_train = 'data/ua.base'
filename_test = 'data/ua.test' for line in open(filename_train):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
train.setdefault(userId,{})
train[userId][itemId] = float(rating) for line in open(filename_test):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
test.setdefault(userId,{})
test[userId][itemId] = float(rating) return train, test def calMean(train):
stat = 0
num = 0
for u in train.keys():
for i in train[u].keys():
stat += train[u][i]
num += 1
mean = stat*1.0/num
return mean def initialBias(train, userNum, movieNum, mean): bu = {}
bi = {}
biNum = {}
buNum = {} u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bi.setdefault(i,0)
biNum.setdefault(i,0)
bi[i] += (train[su][i] - mean)
biNum[i] += 1
u += 1 i = 1
while i < (movieNum+1):
si = str(i)
biNum.setdefault(si,0)
if biNum[si] >= 1:
bi[si] = bi[si]*1.0/(biNum[si]+25)
else:
bi[si] = 0.0
i += 1 u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bu.setdefault(su,0)
buNum.setdefault(su,0)
bu[su] += (train[su][i] - mean - bi[i])
buNum[su] += 1
u += 1 u = 1
while u < (userNum+1):
su = str(u)
buNum.setdefault(su,0)
if buNum[su] >= 1:
bu[su] = bu[su]*1.0/(buNum[su]+10)
else:
bu[su] = 0.0
u += 1 return bu,bi def initialFeature(feature, userNum, movieNum): random.seed(0)
user_feature = {}
item_feature = {}
i = 1
while i < (userNum+1):
si = str(i)
user_feature.setdefault(si,{})
j = 1
while j < (feature+1):
sj = str(j)
user_feature[si].setdefault(sj,random.uniform(0,1))
j += 1
i += 1 i = 1
while i < (movieNum+1):
si = str(i)
item_feature.setdefault(si,{})
j = 1
while j < (feature+1):
sj = str(j)
item_feature[si].setdefault(sj,random.uniform(0,1))
j += 1
i += 1
return user_feature, item_feature def svd(train, test, mean, userNum, movieNum, feature, user_feature, item_feature, bu, bi): gama = 0.02
lamda = 0.3
slowRate = 0.99
step = 0
preRmse = 1000000000.0
nowRmse = 0.0 while step < 100:
rmse = 0.0
n = 0
for u in train.keys():
for i in train[u].keys():
pui = 1.0 * (mean + bu[u] + bi[i])
k = 1
while k < (feature+1):
sk = str(k)
pui += user_feature[u][sk] * item_feature[i][sk]
k += 1
eui = train[u][i] - pui
rmse += pow(eui,2)
n += 1
bu[u] += gama * (eui - lamda * bu[u])
bi[i] += gama * (eui - lamda * bi[i])
k = 1
while k < (feature+1):
sk = str(k)
user_feature[u][sk] += gama*(eui*item_feature[i][sk] - lamda*user_feature[u][sk])
item_feature[i][sk] += gama*(eui*user_feature[u][sk] - lamda*item_feature[i][sk])
k += 1 nowRmse = sqrt(rmse*1.0/n)
print 'step: %d Rmse: %s' % ((step+1), nowRmse)
if (nowRmse < preRmse):
preRmse = nowRmse gama *= slowRate
step += 1
return user_feature, item_feature, bu, bi def calRmse(test, bu, bi, user_feature, item_feature, mean, feature): rmse = 0.0
n = 0
for u in test.keys():
for i in test[u].keys():
pui = 1.0 * (mean + bu[u] + bi[i])
k = 1
while k < (feature+1):
sk = str(k)
pui += user_feature[u][sk] * item_feature[i][sk]
k += 1
eui = pui - test[u][i]
rmse += pow(eui,2)
n += 1
rmse = sqrt(rmse*1.0 / n)
return rmse; if __name__ == "__main__": # load data
train, test = load_data()
print 'load data success' # Calculate overall mean rating
mean = calMean(train)
print 'Calculate overall mean rating success' # initial user and item Bias, respectly
bu, bi = initialBias(train, 943, 1682, mean)
print 'initial user and item Bias, respectly success' # initial user and item feature, respectly
user_feature, item_feature = initialFeature(100, 943, 1682)
print 'initial user and item feature, respectly success' # baseline + svd + stochastic gradient descent
user_feature, item_feature, bu, bi = svd(train, test, mean, 943, 1682, 100, user_feature, item_feature, bu, bi)
print 'baseline + svd + stochastic gradient descent success' # compute the rmse of test set
print 'the Rmse of test test is: %s' % calRmse(test, bu, bi, user_feature, item_feature, mean, 100)
实验参数设置:
(gama = 0.02 lamda =0.3)
feature = 100 maxstep = 100 slowRate = 0.99(随着迭代次数增加,梯度下降幅度越来越小)
方法1结果:Rmse of test set : 1.00422938926
方法2结果:Rmse of test set : 0.963661477881
REFERENCES
1.Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proc. 14th ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining (KDD’08), pp. 426–434, 2008.
2. Y.Koren. The BellKor Solution to the Netflix Grand Prize 2009
基于baseline、svd和stochastic gradient descent的个性化推荐系统的更多相关文章
- 基于baseline和stochastic gradient descent的个性化推荐系统
文章主要介绍的是koren 08年发的论文[1], 2.1 部分内容(其余部分会陆续补充上来). koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长.考虑到写文 ...
- FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?
FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MIN ...
- Stochastic Gradient Descent
一.从Multinomial Logistic模型说起 1.Multinomial Logistic 令为维输入向量; 为输出label;(一共k类); 为模型参数向量: Multinomial Lo ...
- Stochastic Gradient Descent 随机梯度下降法-R实现
随机梯度下降法 [转载时请注明来源]:http://www.cnblogs.com/runner-ljt/ Ljt 作为一个初学者,水平有限,欢迎交流指正. 批量梯度下降法在权值更新前对所有样本汇总 ...
- 机器学习-随机梯度下降(Stochastic gradient descent)
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- 几种梯度下降方法对比(Batch gradient descent、Mini-batch gradient descent 和 stochastic gradient descent)
https://blog.csdn.net/u012328159/article/details/80252012 我们在训练神经网络模型时,最常用的就是梯度下降,这篇博客主要介绍下几种梯度下降的变种 ...
- Stochastic Gradient Descent收敛判断及收敛速度的控制
要判断Stochastic Gradient Descent是否收敛,可以像Batch Gradient Descent一样打印出iteration的次数和Cost的函数关系图,然后判断曲线是否呈现下 ...
- Gradient Descent 和 Stochastic Gradient Descent(随机梯度下降法)
Gradient Descent(Batch Gradient)也就是梯度下降法是一种常用的的寻找局域最小值的方法.其主要思想就是计算当前位置的梯度,取梯度反方向并结合合适步长使其向最小值移动.通过柯 ...
- 随机梯度下降法(Stochastic gradient descent, SGD)
BGD(Batch gradient descent)批量梯度下降法:每次迭代使用所有的样本(样本量小) Mold 一直在更新 SGD(Stochastic gradientdescent)随机 ...
随机推荐
- .Net Core 从MySql数据库生成实体类 Entity Model
1.首先建测试库 2.新建一个.Net Core 项目 3. cd到项目里面执行命令: dotnet add package MySql.Data.EntityFrameworkCore 4.继续执行 ...
- 让er studio 生成带说明的sql
一直使用er studion 来建数据库的模型图. 用了几年苦于 erstudion 不能生成带说明注释的sql 语句,每次生成实体之后都要自己去加注释. 今天根据外国朋友的资料找到了办法 需要自己建 ...
- [JZOJ6340] 【NOIP2019模拟2019.9.4】B
题目 题目大意 给你个非负整数数列\(a\),每次等概率选择大于零的\(a_i\),使其减\(1\). 问\(a_1\)被减到\(0\)的时候期望经过多少次操作. 思考历程 对于这题的暴力做法,显然可 ...
- 选择器zuoye
代码: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title ...
- 软件设计师_朴素模式匹配算法和KMP算法
1.从主字符串中匹配模式字符串(暴力匹配) 2. KMP算法
- thinkphp 运算符
我们可以对模板输出使用运算符,包括对“+”“ –” “*” “/”和“%”的支持. 大理石平台厂家 例如: 运算符 使用示例 + {$a+$b} - {$a-$b} * {$a*$b} / {$a/$ ...
- Oracle Spatial导入shp数据
现在开始尝试用oracle spatial管理空间数据,刚学会shp数据的导入,总结如下.oracle11g安装后,已经有了oracle spatial组件,我们只需要用shp2sdo.exe工具,就 ...
- 在ubuntu下编写python
一般情况下,ubuntu已经安装了python,打开终端,直接输入python,即可进行python编写. 默认为python2 如果想写python3,在终端输入python3即可. 如果需要执行大 ...
- 从零开始学习jQuery (六) jquery中的AJAX使用
本篇文章讲解如何使用jQuery方便快捷的实现Ajax功能.统一所有开发人员使用Ajax的方式. 一.摘要 本系列文章将带您进入jQuery的精彩世界, 其中有很多作者具体的使用经验和解决方案, 即 ...
- 如何将指定文件或文件夹直接提交到svn指定目录?
如何将指定文件或文件夹直接提交到svn指定目录? 一般我们都是按以下步骤操作的: 1.先将那个目录checkout下来 2.将要添加的文件或者文件夹放到这个目录中 3.右击文件执行svn菜单中的add ...