（数据挖掘-入门-3）基于用户的协同过滤之k近邻

主要内容：

1、k近邻

2、python实现

1、什么是k近邻（KNN）

在入门-1中，简单地实现了基于用户协同过滤的最近邻算法，所谓最近邻，就是找到距离最近或最相似的用户，将他的物品推荐出来。

而这里，k近邻（K Nearest Neighbor）的意思就是，找出最近或最相似的k个用户，将他们的评分（相似度权重求和）最高的几个物品进行推荐。

2、python实现

代码中有两个数据集，

一个是直接写在的代码中的users；

一个是包含在BX-Book-Ratings.csv、BX-Books.csv、BX-Users.csv文件中；（下载地址：http://www.guidetodatamining.com/assets/data/BX-Dump.zip）

代码：

import codecs

from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,

                      "Norah Jones": 4.5, "Phoenix": 5.0,

                      "Slightly Stoopid": 1.5,

                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,

                 "Deadmau5": 4.0, "Phoenix": 2.0,

                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,

                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,

                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,

                 "Deadmau5": 4.5, "Phoenix": 3.0,

                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,

                    "Norah Jones": 4.0, "The Strokes": 4.0,

                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,

                     "Norah Jones": 5.0, "Phoenix": 5.0,

                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,

                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,

                 "Norah Jones": 3.0, "Phoenix": 5.0,

                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,

                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,

                      "The Strokes": 3.0}

        }

class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):

        """ initialize recommender

        currently, if data is dictionary the recommender is initialized

        to it.

        For all other data types of data, no initialization occurs

        k is the k value for k nearest neighbor

        metric is which distance formula to use

        n is the maximum number of recommendations to make"""

        self.k = k

        self.n = n

        self.username2id = {}

        self.userid2name = {}

        self.productid2name = {}

        # for some reason I want to save the name of the metric

        self.metric = metric

        if self.metric == 'pearson':

            self.fn = self.pearson

        #

        # if data is dictionary set recommender data to it

        #

        if type(data).__name__ == 'dict':

            self.data = data

    def convertProductID2name(self, id):

        """Given product id number return product name"""

        if id in self.productid2name:

            return self.productid2name[id]

        else:

            return id

    def userRatings(self, id, n):

        """Return n top ratings for user with id"""

        print ("Ratings for " + self.userid2name[id])

        ratings = self.data[id]

        print(len(ratings))

        ratings = list(ratings.items())

        ratings = [(self.convertProductID2name(k), v)

                   for (k, v) in ratings]

        # finally sort and return

        ratings.sort(key=lambda artistTuple: artistTuple[1],

                     reverse = True)

        ratings = ratings[:n]

        for rating in ratings:

            print("%s\t%i" % (rating[0], rating[1]))

    def loadBookDB(self, path=''):

        """loads the BX book dataset. Path is where the BX files are

        located"""

        self.data = {}

        i = 0

        #

        # First load book ratings into self.data

        #

        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            user = fields[0].strip('"')

            book = fields[1].strip('"')

            rating = int(fields[2].strip().strip('"'))

            if user in self.data:

                currentRatings = self.data[user]

            else:

                currentRatings = {}

            currentRatings[book] = rating

            self.data[user] = currentRatings

        f.close()

        #

        # Now load books into self.productid2name

        # Books contains isbn, title, and author among other fields

        #

        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #separate line into fields

            fields = line.split(';')

            isbn = fields[0].strip('"')

            title = fields[1].strip('"')

            author = fields[2].strip().strip('"')

            title = title + ' by ' + author

            self.productid2name[isbn] = title

        f.close()

        #

        #  Now load user info into both self.userid2name and

        #  self.username2id

        #

        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')

        for line in f:

            i += 1

            #print(line)

            #separate line into fields

            fields = line.split(';')

            userid = fields[0].strip('"')

            location = fields[1].strip('"')

            if len(fields) > 3:

                age = fields[2].strip().strip('"')

            else:

                age = 'NULL'

            if age != 'NULL':

                value = location + '  (age: ' + age + ')'

            else:

                value = location

            self.userid2name[userid] = value

            self.username2id[location] = userid

        f.close()

        print(i)

    def pearson(self, rating1, rating2):

        sum_xy = 0

        sum_x = 0

        sum_y = 0

        sum_x2 = 0

        sum_y2 = 0

        n = 0

        for key in rating1:

            if key in rating2:

                n += 1

                x = rating1[key]

                y = rating2[key]

                sum_xy += x * y

                sum_x += x

                sum_y += y

                sum_x2 += pow(x, 2)

                sum_y2 += pow(y, 2)

        if n == 0:

            return 0

        # now compute denominator

        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)

                       * sqrt(sum_y2 - pow(sum_y, 2) / n))

        if denominator == 0:

            return 0

        else:

            return (sum_xy - (sum_x * sum_y) / n) / denominator

    def computeNearestNeighbor(self, username):

        """creates a sorted list of users based on their distance to

        username"""

        distances = []

        for instance in self.data:

            if instance != username:

                distance = self.fn(self.data[username],

                                   self.data[instance])

                distances.append((instance, distance))

        # sort based on distance -- closest first

        distances.sort(key=lambda artistTuple: artistTuple[1],

                       reverse=True)

        return distances

    def recommend(self, user):

       """Give list of recommendations"""

       recommendations = {}

       # first get list of users  ordered by nearness

       nearest = self.computeNearestNeighbor(user)

       #

       # now get the ratings for the user

       #

       userRatings = self.data[user]

       #

       # determine the total distance

       totalDistance = 0.0

       for i in range(self.k):

          totalDistance += nearest[i][1]

       # now iterate through the k nearest neighbors

       # accumulating their ratings

       for i in range(self.k):

          # compute slice of pie

          weight = nearest[i][1] / totalDistance

          # get the name of the person

          name = nearest[i][0]

          # get the ratings for this person

          neighborRatings = self.data[name]

          # get the name of the person

          # now find bands neighbor rated that user didn't

          for artist in neighborRatings:

             if not artist in userRatings:

                if artist not in recommendations:

                   recommendations[artist] = (neighborRatings[artist]

                                              * weight)

                else:

                   recommendations[artist] = (recommendations[artist]

                                              + neighborRatings[artist]

                                              * weight)

       # now make list from dictionary

       recommendations = list(recommendations.items())

       recommendations = [(self.convertProductID2name(k), v)

                          for (k, v) in recommendations]

       # finally sort and return

       recommendations.sort(key=lambda artistTuple: artistTuple[1],

                            reverse = True)

       # Return the first n items

       return recommendations[:self.n]

if __name__ == '__main__':

    # users as dataset

    r=recommender(users)

    print r.recommend('Jordyn')

    print r.recommend('Hailey')

    # file as dataset

    r.loadBookDB('BX-Dump/BX-Dump/')

    print r.recommend('')

    print r.userRatings('', 5)

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

推荐召回--基于用户的协同过滤UserCF
目录 1. 前言 2. 原理 3. 数据及相似度计算 4. 根据相似度计算结果 5. 相关问题 5.1 如何提炼用户日志数据? 5.2 用户相似度计算很耗时,有什么好的方法? 5.3 有哪些改进措施? ...
基于用户的协同过滤电影推荐user-CF python
协同过滤包括基于物品的协同过滤和基于用户的协同过滤,本文基于电影评分数据做基于用户的推荐主要做三个部分:1.读取数据:2.构建用户与用户的相似度矩阵:3.进行推荐: 查看数据u.data 主要用到前 ...
Mahout实现基于用户的协同过滤算法
Mahout中对协同过滤算法进行了封装,看一个简单的基于用户的协同过滤算法. 基于用户:通过用户对物品的偏好程度来计算出用户的在喜好上的近邻,从而根据近邻的喜好推测出用户的喜好并推荐. 图片来源程序 ...
【推荐系统实战】：C++实现基于用户的协同过滤（UserCollaborativeFilter）
好早的时候就打算写这篇文章,可是还是參加阿里大数据竞赛的第一季三月份的时候实验就完毕了.硬生生是拖到了十一假期.自己也是醉了... 找工作不是非常顺利,希望写点东西回想一下知识.然后再攒点人品吧,仅仅 ...
基于用户的协同过滤的电影推荐算法(tensorflow)
数据集: https://grouplens.org/datasets/movielens/ ml-latest-small 协同过滤算法理论基础 https://blog.csdn.net/u012 ...
（数据挖掘-入门-6）十折交叉验证和K近邻
主要内容: 1.十折交叉验证 2.混淆矩阵 3.K近邻 4.python实现一.十折交叉验证前面提到了数据集分为训练集和测试集,训练集用来训练模型,而测试集用来测试模型的好坏,那么单一的测试是否就 ...
案例：Spark基于用户的协同过滤算法
https://mp.weixin.qq.com/s?__biz=MzA3MDY0NTMxOQ==&mid=2247484291&idx=1&sn=4599b4e31c2190 ...
基于用户的协同过滤（UserCF）
Music Recommendation System with User-based and Item-based Collaborative Filtering Technique(使用基于用户及基于物品的协同过滤技术的音乐推荐系统)【更新】
摘要: 大数据催生了互联网,电子商务,也导致了信息过载.信息过载的问题可以由推荐系统来解决.推荐系统可以提供选择新产品(电影,音乐等)的建议.这篇论文介绍了一个音乐推荐系统,它会根据用户的历史行为和口 ...

随机推荐

【数论】Codeforces Round #483 (Div. 2) [Thanks, Botan Investments and Victor Shaburov!] C. Finite or not?
题意:给你一个分数,问你在b进制下能否化成有限小数. 条件:p/q假如已是既约分数,那么如果q的质因数分解集合是b的子集,就可以化成有限小数,否则不能. 参见代码:反复从q中除去b和q的公因子部分,并 ...
Shell 学习笔记之运算符
基本运算符算术运算符 val = expr 2 + 2 需要注意的是表达式和运算符之间需要有空格(比如2 + 2,不能是2+2) 两边最外面的字符是`,在esc键下面,不是引号哦乘号* 前面必须 ...
机器学习(2)：Softmax回归原理及其实现
Softmax回归用于处理多分类问题,是Logistic回归的一种推广.这两种回归都是用回归的思想处理分类问题.这样做的一个优点就是输出的判断为概率值,便于直观理解和决策.下面我们介绍它的原理和实现. ...
PHP获取文件大小详解
通过PHP filesize函数可直接获取文件大小(单位字节),如:filesize('test.png') echo filesize('test.png'); 查看test.png图片属性: 文件 ...
unity热更新
Unity3D 学习笔记4 —— UGUI+uLua游戏框架 C#Light 和 uLua的对比第二弹在Unity中使用Lua脚本:语言层和游戏逻辑粘合层处理 Ulua_toLua_基本案例 Uni ...
MC34063A development aid
http://www.nomad.ee/micros/mc34063a/index.shtml This is a simple-minded design tool that allows you ...
High Voltage Boost Supply
http://learn.adafruit.com/ice-tube-clock-kit/design Tubes such as VFDs, Nixies, Decatrons, etc requi ...
引子——从Mac OS X的Lion说起
最近感悟越来愈多,女儿越来越大,头发越来越少,我知道,自己老了. 30岁之后,时间仿佛开闸的河水一样滚滚而去,感觉自己浪费的时间太多.我们不得不承认,先知先觉的人会比我们领先10年甚至更多的身位.所以 ...
C#编程（十）----------C#预处理器
原文链接:http://blog.csdn.net/shanyongxu/article/details/46491757 C#中的预处理器指令 #IF 如果 C# 编译器遇到最后面跟有 #endif ...
git跟踪远程分支，查看本地分支追踪和远程分支的关系
跟踪远程分支如果用git push指令时,当前分支没有跟踪远程分支(没有和远程分支建立联系),那么就会git就会报错 There is no tracking information for the ...

（数据挖掘-入门-3）基于用户的协同过滤之k近邻

1、什么是k近邻（KNN）

2、python实现

（数据挖掘-入门-3）基于用户的协同过滤之k近邻的更多相关文章

随机推荐

热门专题