主要内容:

1、k近邻

2、python实现

1、什么是k近邻(KNN)

在入门-1中,简单地实现了基于用户协同过滤的最近邻算法,所谓最近邻,就是找到距离最近或最相似的用户,将他的物品推荐出来。

而这里,k近邻(K Nearest Neighbor)的意思就是,找出最近或最相似的k个用户,将他们的评分(相似度权重求和)最高的几个物品进行推荐。

2、python实现

代码中有两个数据集,

一个是直接写在的代码中的users;

一个是包含在BX-Book-Ratings.csv、BX-Books.csv、BX-Users.csv文件中;(下载地址:http://www.guidetodatamining.com/assets/data/BX-Dump.zip)

代码:

import codecs
from math import sqrt users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0}
} class recommender: def __init__(self, data, k=1, metric='pearson', n=5):
""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id = {}
self.userid2name = {}
self.productid2name = {}
# for some reason I want to save the name of the metric
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# if data is dictionary set recommender data to it
#
if type(data).__name__ == 'dict':
self.data = data def convertProductID2name(self, id):
"""Given product id number return product name"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id def userRatings(self, id, n):
"""Return n top ratings for user with id"""
print ("Ratings for " + self.userid2name[id])
ratings = self.data[id]
print(len(ratings))
ratings = list(ratings.items())
ratings = [(self.convertProductID2name(k), v)
for (k, v) in ratings]
# finally sort and return
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
ratings = ratings[:n]
for rating in ratings:
print("%s\t%i" % (rating[0], rating[1])) def loadBookDB(self, path=''):
"""loads the BX book dataset. Path is where the BX files are
located"""
self.data = {}
i = 0
#
# First load book ratings into self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[book] = rating
self.data[user] = currentRatings
f.close()
#
# Now load books into self.productid2name
# Books contains isbn, title, and author among other fields
#
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title + ' by ' + author
self.productid2name[isbn] = title
f.close()
#
# Now load user info into both self.userid2name and
# self.username2id
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
#print(line)
#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + ' (age: ' + age + ')'
else:
value = location
self.userid2name[userid] = value
self.username2id[location] = userid
f.close()
print(i) def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
# now compute denominator
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator def computeNearestNeighbor(self, username):
"""creates a sorted list of users based on their distance to
username"""
distances = []
for instance in self.data:
if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
distances.append((instance, distance))
# sort based on distance -- closest first
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances def recommend(self, user):
"""Give list of recommendations"""
recommendations = {}
# first get list of users ordered by nearness
nearest = self.computeNearestNeighbor(user)
#
# now get the ratings for the user
#
userRatings = self.data[user]
#
# determine the total distance
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# now iterate through the k nearest neighbors
# accumulating their ratings
for i in range(self.k):
# compute slice of pie
weight = nearest[i][1] / totalDistance
# get the name of the person
name = nearest[i][0]
# get the ratings for this person
neighborRatings = self.data[name]
# get the name of the person
# now find bands neighbor rated that user didn't
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# now make list from dictionary
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# Return the first n items
return recommendations[:self.n] if __name__ == '__main__':
# users as dataset
r=recommender(users)
print r.recommend('Jordyn')
print r.recommend('Hailey') # file as dataset
r.loadBookDB('BX-Dump/BX-Dump/')
print r.recommend('') print r.userRatings('', 5)

(数据挖掘-入门-3)基于用户的协同过滤之k近邻的更多相关文章

  1. 推荐召回--基于用户的协同过滤UserCF

    目录 1. 前言 2. 原理 3. 数据及相似度计算 4. 根据相似度计算结果 5. 相关问题 5.1 如何提炼用户日志数据? 5.2 用户相似度计算很耗时,有什么好的方法? 5.3 有哪些改进措施? ...

  2. 基于用户的协同过滤电影推荐user-CF python

    协同过滤包括基于物品的协同过滤和基于用户的协同过滤,本文基于电影评分数据做基于用户的推荐 主要做三个部分:1.读取数据:2.构建用户与用户的相似度矩阵:3.进行推荐: 查看数据u.data 主要用到前 ...

  3. Mahout实现基于用户的协同过滤算法

    Mahout中对协同过滤算法进行了封装,看一个简单的基于用户的协同过滤算法. 基于用户:通过用户对物品的偏好程度来计算出用户的在喜好上的近邻,从而根据近邻的喜好推测出用户的喜好并推荐. 图片来源 程序 ...

  4. 【推荐系统实战】:C++实现基于用户的协同过滤(UserCollaborativeFilter)

    好早的时候就打算写这篇文章,可是还是參加阿里大数据竞赛的第一季三月份的时候实验就完毕了.硬生生是拖到了十一假期.自己也是醉了... 找工作不是非常顺利,希望写点东西回想一下知识.然后再攒点人品吧,仅仅 ...

  5. 基于用户的协同过滤的电影推荐算法(tensorflow)

    数据集: https://grouplens.org/datasets/movielens/ ml-latest-small 协同过滤算法理论基础 https://blog.csdn.net/u012 ...

  6. (数据挖掘-入门-6)十折交叉验证和K近邻

    主要内容: 1.十折交叉验证 2.混淆矩阵 3.K近邻 4.python实现 一.十折交叉验证 前面提到了数据集分为训练集和测试集,训练集用来训练模型,而测试集用来测试模型的好坏,那么单一的测试是否就 ...

  7. 案例:Spark基于用户的协同过滤算法

    https://mp.weixin.qq.com/s?__biz=MzA3MDY0NTMxOQ==&mid=2247484291&idx=1&sn=4599b4e31c2190 ...

  8. 基于用户的协同过滤(UserCF)

  9. Music Recommendation System with User-based and Item-based Collaborative Filtering Technique(使用基于用户及基于物品的协同过滤技术的音乐推荐系统)【更新】

    摘要: 大数据催生了互联网,电子商务,也导致了信息过载.信息过载的问题可以由推荐系统来解决.推荐系统可以提供选择新产品(电影,音乐等)的建议.这篇论文介绍了一个音乐推荐系统,它会根据用户的历史行为和口 ...

随机推荐

  1. python开发_webbrowser_浏览器控制模块

    ''' python的webbrowser模块支持对浏览器进行一些操作 主要有以下三个方法: webbrowser.open(url, new=0, autoraise=True) webbrowse ...

  2. zoj 3157 Weapon 逆序数/树状数组

    B - Weapon Time Limit:1000MS     Memory Limit:32768KB     64bit IO Format:%lld & %llu Submit Sta ...

  3. Asky极简教程:零基础1小时学编程,已更新前8节

    Asky极简架构 开源Asky极简架构.超轻量级.高并发.水平扩展.微服务架构 <Asky极简教程:零基础1小时学编程>开源教程 零基础入门,从零开始全程演示,如何开发一个大型互联网系统, ...

  4. postgresql的ALTER经常使用操作

    postgresql版本号:psql (9.3.4) 1.添加一列ALTER TABLE table_name ADD column_name datatype; 2.删除一列 ALTER TABLE ...

  5. 如何在windows server 2008上配置NLB群集

    参考:http://zlwdouhao.blog.51cto.com/731028/781828 前些天写了一篇关于NLB群集模式多播和单播的简单介绍.那么下面我们一起来探讨一下,如何在windows ...

  6. HTML中显示的文字自动换行

    在html中控制自动换行 http://www.cnblogs.com/zjxbetter/articles/1323449.html eg: <table> <tr> < ...

  7. MVC之Ajax如影随行

    一.Ajax的前世今生 我一直觉得google是一家牛逼的公司,为什么这样说呢?<舌尖上的中国>大家都看了,那些美食估计你是百看不厌,但是里边我觉得其实也有这样的一个哲学:关于食材,对于种 ...

  8. 用Qemu模拟vexpress-a9 (七) --- 嵌入式设备上安装telnet服务

    转载: http://blog.csdn.net/liuqz2009/article/details/6921789 Telnet协议是登陆远程网 络主机最简单的方法之一,只是安全性非常低.对targ ...

  9. ubuntu下如何查看自己的外网IP

    1.1 安装使用curl命令实现      sudo apt-get install curl1.2 输入命令      curl ifconfig.me

  10. struts2点滴记录

    1.s:textfield 赋值方法 <s:textfield name="Tname" value="%{#session.Teacher.name}" ...