（数据挖掘-入门-3）基于用户的协同过滤之k近邻

主要内容：

1、k近邻

2、python实现

1、什么是k近邻（KNN）

在入门-1中，简单地实现了基于用户协同过滤的最近邻算法，所谓最近邻，就是找到距离最近或最相似的用户，将他的物品推荐出来。

而这里，k近邻（K Nearest Neighbor）的意思就是，找出最近或最相似的k个用户，将他们的评分（相似度权重求和）最高的几个物品进行推荐。

2、python实现

代码中有两个数据集，

一个是直接写在的代码中的users；

一个是包含在BX-Book-Ratings.csv、BX-Books.csv、BX-Users.csv文件中；（下载地址：http://www.guidetodatamining.com/assets/data/BX-Dump.zip）

代码：

import codecs

from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,

                      "Norah Jones": 4.5, "Phoenix": 5.0,

                      "Slightly Stoopid": 1.5,

                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,

                 "Deadmau5": 4.0, "Phoenix": 2.0,

                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,

                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,

                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,

                 "Deadmau5": 4.5, "Phoenix": 3.0,

                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,

                    "Norah Jones": 4.0, "The Strokes": 4.0,

                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,

                     "Norah Jones": 5.0, "Phoenix": 5.0,

                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,

                 "Norah Jones": 3.0, "Phoenix": 5.0,

                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,

                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,

                      "The Strokes": 3.0}

        }

class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):

        """ initialize recommender

        currently, if data is dictionary the recommender is initialized

        to it.

        For all other data types of data, no initialization occurs

        k is the k value for k nearest neighbor

        metric is which distance formula to use

        n is the maximum number of recommendations to make"""

        self.k = k

        self.n = n

        self.username2id = {}

        self.userid2name = {}

        self.productid2name = {}

        # for some reason I want to save the name of the metric

        self.metric = metric

        if self.metric == 'pearson':

            self.fn = self.pearson

        #

        # if data is dictionary set recommender data to it

        #

        if type(data).__name__ == 'dict':

            self.data = data

    def convertProductID2name(self, id):

        """Given product id number return product name"""

        if id in self.productid2name:

            return self.productid2name[id]

        else:

            return id

    def userRatings(self, id, n):

        """Return n top ratings for user with id"""

        print ("Ratings for " + self.userid2name[id])

        ratings = self.data[id]

        print(len(ratings))

        ratings = list(ratings.items())

        ratings = [(self.convertProductID2name(k), v)

                   for (k, v) in ratings]

        # finally sort and return

        ratings.sort(key=lambda artistTuple: artistTuple[1],

                     reverse = True)

        ratings = ratings[:n]

        for rating in ratings:

            print("%s\t%i" % (rating[0], rating[1]))

    def loadBookDB(self, path=''):

        """loads the BX book dataset. Path is where the BX files are

        located"""

        self.data = {}

        i = 0

        #

        # First load book ratings into self.data

        #

        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            user = fields[0].strip('"')

            book = fields[1].strip('"')

            rating = int(fields[2].strip().strip('"'))

            if user in self.data:

                currentRatings = self.data[user]

            else:

                currentRatings = {}

            currentRatings[book] = rating

            self.data[user] = currentRatings

        f.close()

        #

        # Now load books into self.productid2name

        # Books contains isbn, title, and author among other fields

        #

        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            isbn = fields[0].strip('"')

            title = fields[1].strip('"')

            author = fields[2].strip().strip('"')

            title = title + ' by ' + author

            self.productid2name[isbn] = title

        f.close()

        #

        #  Now load user info into both self.userid2name and

        #  self.username2id

        #

        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #print(line)

            #separate line into fields

            fields = line.split(';')

            userid = fields[0].strip('"')

            location = fields[1].strip('"')

            if len(fields) > 3:

                age = fields[2].strip().strip('"')

            else:

                age = 'NULL'

            if age != 'NULL':

                value = location + '  (age: ' + age + ')'

            else:

                value = location

            self.userid2name[userid] = value

            self.username2id[location] = userid

        f.close()

        print(i)

    def pearson(self, rating1, rating2):

        sum_xy = 0

        sum_x = 0

        sum_y = 0

        sum_x2 = 0

        sum_y2 = 0

        n = 0

        for key in rating1:

            if key in rating2:

                n += 1

                x = rating1[key]

                y = rating2[key]

                sum_xy += x * y

                sum_x += x

                sum_y += y

                sum_x2 += pow(x, 2)

                sum_y2 += pow(y, 2)

        if n == 0:

            return 0

        # now compute denominator

        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)

                       * sqrt(sum_y2 - pow(sum_y, 2) / n))

        if denominator == 0:

            return 0

        else:

            return (sum_xy - (sum_x * sum_y) / n) / denominator

    def computeNearestNeighbor(self, username):

        """creates a sorted list of users based on their distance to

        username"""

        distances = []

        for instance in self.data:

            if instance != username:

                distance = self.fn(self.data[username],

                                   self.data[instance])

                distances.append((instance, distance))

        # sort based on distance -- closest first

        distances.sort(key=lambda artistTuple: artistTuple[1],

                       reverse=True)

        return distances

    def recommend(self, user):

       """Give list of recommendations"""

       recommendations = {}

       # first get list of users  ordered by nearness

       nearest = self.computeNearestNeighbor(user)

       #

       # now get the ratings for the user

       #

       userRatings = self.data[user]

       #

       # determine the total distance

       totalDistance = 0.0

       for i in range(self.k):

          totalDistance += nearest[i][1]

       # now iterate through the k nearest neighbors

       # accumulating their ratings

       for i in range(self.k):

          # compute slice of pie

          weight = nearest[i][1] / totalDistance

          # get the name of the person

          name = nearest[i][0]

          # get the ratings for this person

          neighborRatings = self.data[name]

          # get the name of the person

          # now find bands neighbor rated that user didn't

          for artist in neighborRatings:

             if not artist in userRatings:

                if artist not in recommendations:

                   recommendations[artist] = (neighborRatings[artist]

                                              * weight)

                else:

                   recommendations[artist] = (recommendations[artist]

                                              + neighborRatings[artist]

                                              * weight)

       # now make list from dictionary

       recommendations = list(recommendations.items())

       recommendations = [(self.convertProductID2name(k), v)

                          for (k, v) in recommendations]

       # finally sort and return

       recommendations.sort(key=lambda artistTuple: artistTuple[1],

                            reverse = True)

       # Return the first n items

       return recommendations[:self.n]

if __name__ == '__main__':

    # users as dataset

    r=recommender(users)

    print r.recommend('Jordyn')

    print r.recommend('Hailey')

    # file as dataset

    r.loadBookDB('BX-Dump/BX-Dump/')

    print r.recommend('')

    print r.userRatings('', 5)

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

推荐召回--基于用户的协同过滤UserCF
目录 1. 前言 2. 原理 3. 数据及相似度计算 4. 根据相似度计算结果 5. 相关问题 5.1 如何提炼用户日志数据? 5.2 用户相似度计算很耗时,有什么好的方法? 5.3 有哪些改进措施? ...
基于用户的协同过滤电影推荐user-CF python
协同过滤包括基于物品的协同过滤和基于用户的协同过滤,本文基于电影评分数据做基于用户的推荐主要做三个部分:1.读取数据:2.构建用户与用户的相似度矩阵:3.进行推荐: 查看数据u.data 主要用到前 ...
Mahout实现基于用户的协同过滤算法
Mahout中对协同过滤算法进行了封装,看一个简单的基于用户的协同过滤算法. 基于用户:通过用户对物品的偏好程度来计算出用户的在喜好上的近邻,从而根据近邻的喜好推测出用户的喜好并推荐. 图片来源程序 ...
【推荐系统实战】：C++实现基于用户的协同过滤（UserCollaborativeFilter）
好早的时候就打算写这篇文章,可是还是參加阿里大数据竞赛的第一季三月份的时候实验就完毕了.硬生生是拖到了十一假期.自己也是醉了... 找工作不是非常顺利,希望写点东西回想一下知识.然后再攒点人品吧,仅仅 ...
基于用户的协同过滤的电影推荐算法(tensorflow)
数据集: https://grouplens.org/datasets/movielens/ ml-latest-small 协同过滤算法理论基础 https://blog.csdn.net/u012 ...
（数据挖掘-入门-6）十折交叉验证和K近邻
主要内容: 1.十折交叉验证 2.混淆矩阵 3.K近邻 4.python实现一.十折交叉验证前面提到了数据集分为训练集和测试集,训练集用来训练模型,而测试集用来测试模型的好坏,那么单一的测试是否就 ...
案例：Spark基于用户的协同过滤算法
https://mp.weixin.qq.com/s?__biz=MzA3MDY0NTMxOQ==&mid=2247484291&idx=1&sn=4599b4e31c2190 ...
基于用户的协同过滤（UserCF）
Music Recommendation System with User-based and Item-based Collaborative Filtering Technique(使用基于用户及基于物品的协同过滤技术的音乐推荐系统)【更新】
摘要: 大数据催生了互联网,电子商务,也导致了信息过载.信息过载的问题可以由推荐系统来解决.推荐系统可以提供选择新产品(电影,音乐等)的建议.这篇论文介绍了一个音乐推荐系统,它会根据用户的历史行为和口 ...

随机推荐

luoguP4696 [CEOI2011]Matching KMP+树状数组
可以非常轻易的将题意转化为有多少子串满足排名相同注意到$KMP$算法只会在当前字符串的某尾添加和删除字符因此,如果添加和删除后面的字符对于前面的字符没有影响时,我们可以用$KMP$来模糊匹配对于 ...
[BZOJ4699]树上的最短路(最短路+线段树)
https://www.cnblogs.com/Gloid/p/10273902.html 这篇文章已经从头到尾讲的非常清楚了,几乎没有什么需要补充的内容. 首先$O(n\log^2 n)$的做法比较 ...
【洛谷】2473：[SCOI2008]奖励关【期望DP（倒推）】
P2473 [SCOI2008]奖励关题目背景 08四川NOI省选题目描述你正在玩你最喜欢的电子游戏,并且刚刚进入一个奖励关.在这个奖励关里,系统将依次随机抛出k次宝物,每次你都可以选择吃或者不 ...
MySQL规约（阿里巴巴）
建表规约 [强制]表达是与否概念的字段,必须使用 is _ xxx 的方式命名,数据类型是 unsigned tinyint ( 1 表示是,0 表示否 ) ,此规则同样适用于 odps 建表. 说明 ...
python开发_tkinter_单选按钮
这篇blog主要是描述python中tkinter的单选按钮操作下面是我做的demo 运行效果: ====================================== 代码部分: ===== ...
PostgreSQL控制台以竖行显示
\x select * from user; 这个和MySQL的有点区别,在查询之前使用\x进行显示的开启注意:只需要用一次即可,以后的查询都是以竖行进行显示.
git diff 打补丁
[root@workstation2017 demo]# git diff old new >cc.diff[root@workstation2017 demo]# cat cc.diffdif ...
编译Opencv的GPU，利用CUDA加速
首先检查自己的机器是否支持,否则都是白搭(仅仅有NVIDIA的显卡才支持.可在设备管理器中查看) 假设不用GPU.能够直接官网下载预编译好的库环境: 1 VS2013 2 Opencv2.4.9 3 ...
配置Tomcat成为系统服务
国内私募机构九鼎控股打造APP,来就送 20元现金领取地址:http://jdb.jiudingcapital.com/phone.html内部邀请码:C8E245J (不写邀请码,没有现金送)国内私 ...
Arcgis10.5 python按属性分割图层，属性相同分为一个图层
# coding=utf-8 """ Source code for potential gp tool to create outputs based on attri ...

（数据挖掘-入门-3）基于用户的协同过滤之k近邻

1、什么是k近邻（KNN）

2、python实现

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

随机推荐

热门专题