1. 定义
协同过滤(Collaborative Filtering)有狭义和广义两种意义:
广义协同过滤:对来源不同的数据,根据他们的共同点做过滤处理。
Collaborative filtering (CF) is a technique used by some recommender systems.[1] Collaborative filtering has two senses, a narrow one and a more general one.[2] In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.[2] 
协议的协同过滤:它假设如果用户A和B对一个话题Y有相同的兴趣,那么A和B对另一个话题X都感兴趣的概率比随机抽取两个人且都对话题X感兴趣的概率高。所以它能根据收集到的用户兴趣预测一个用户是否对某个话题感兴趣。
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.
下面这个图非常形象的说明了协议的CF的工作原理:



2. 实现过程

协同过滤一般需要1)用户积极参与;2)用户表达他们的兴趣;3)能够找出相同兴趣用户的算法。

推荐系统典型的工作流程如下:

  1. 用户对一些物品评分来表达他们的兴趣或偏好;
  2. 系统根据评分找出“最相似”用户;
  3. 对于兴趣相似的用户组内部,如果一部分用户A对某物品X给了很高评分,而其他的用户Y还没有对X评分(还没参与,比如对于一部电影,可能还没看过),就把X推荐给用户Y。
 关键的问题在于如果对“相邻”的用户的兴趣进行组合和分配权重。有时,被推荐某物品X的用户会立即对X进行评分,所以推荐系统可以不断的提高自己的精度。

Collaborative filtering algorithms often require (1) users’ active participation, (2) an easy way to represent users’ interests to the system, and (3) algorithms that are able to match people with similar interests.

Typically, the workflow of a collaborative filtering system is:

  1. A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
  2. The system matches this user’s ratings against other users’ and finds the people with most “similar” tastes.
  3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

3. 公式

Memory-based[edit]

一般情况,有 基于邻近关系、基于用户、基于话题三种类型的推荐系统。例如,

基于用户u对话题i的评分,可以通过“相似用户”的评分,通过一个聚合函数(aggr)计算出来。

Typical examples of this mechanism are neighbourhood based CF and item-based/user-based top-N recommendations.[3] For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an aggregation of some similar users rating to the item:

       

其中,U代表与u最相似且已经对i话题评分的N个用户集合

where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'.

常见的聚合函数(aggr)有:

Some examples of the aggregation function includes:

  平均值
  相似度加权平均值
  修正式相似度加权平均值

where k is a normalizing factor defined as . and  is the average rating of user u for all the items rated by that user.

基于相似度的算法:

找出用户或话题相似度是它的重要部分。一般使用皮尔森相似度(平移无关特点)和余弦相似度算法。

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple mechanisms such as Pearson correlation and vector cosinebased similarity are used for this.

The Pearson correlation similarity of two users x, y is defined as

where Ixy is the set of items rated by both user x and user y.

The cosine-based approach defines the cosine-similarity between two users x and y as:[4]

当找到最相似的N个用户后,使用knn算法将对应的度量属性聚合到被推荐的用户上。

The user based top-N recommendation algorithm identifies the k most similar users to an active user using similarity based vector model. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive hashing, which implements the nearest neighbor mechanism in linear time.

优点:

  • 结果很容易解析(这点对推荐系统很重要);
  • 容易使用;
  • 可以增量计算;
  • 不需要关心话题的内容;
  • 对协同评分的话题扩展性好?

The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; it is easy to create and use; new data can be added easily and incrementally; it need not consider the content of the items being recommended; and the mechanism scales well with co-rated items.

缺点:

  • 依赖于人的评分;
  • 稀疏数据性能差,而稀疏数据是很平常的
  • 对大的数据集扩展性差

There are several disadvantages with this approach. First, it depends on human ratings. Second, its performance decreases when data gets sparse, which is frequent with web related items. This prevents the scalability of this approach and has problems with large datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes more complicated since that representation usually relies on a specific vector space. That would require to include the new item and re-insert all the elements in the structure.

RECOMMENDER SYSTEMS 

Model-based[edit]

基于模型的CF。

Models are developed using data miningmachine learning algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model-based CF algorithms. These include Bayesian networksclustering modelslatent semantic models such as singular value decomposition,probabilistic latent semantic analysis, Multiple Multiplicative Factor, Latent Dirichlet allocation and markov decision process based models.[5]

这种方法通过更整合的目的去揭露潜在的属性,且大部分通过某个分类或聚类技术实现,参数的数量可通过PCA来减少。

This approach has a more holistic goal to uncover latent factors that explain observed ratings.[6] Most of the models are based on creating a classification or clustering technique to identify the user based on the test set. The number of the parameters can be reduced based on types of principal component analysis.

优点:稀疏数据和大数据扩展性好

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

缺点:需要对预测质量和扩展性之间做出评估。参数的减少可能导致一些有用信息丢失。

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

spark MLlib 概念 4: 协同过滤(CF)的更多相关文章

  1. 【Machine Learning】Mahout基于协同过滤(CF)的用户推荐

    一.Mahout推荐算法简介 Mahout算法框架自带的推荐器有下面这些: l  GenericUserBasedRecommender:基于用户的推荐器,用户数量少时速度快: l  GenericI ...

  2. Spark Mllib里的协调过滤的概念和实现步骤、LS、ALS的原理、ALS算法优化过程的推导、隐式反馈和ALS-WR算法

    不多说,直接上干货! 常见的推荐算法 1.基于关系规则的推荐 2.基于内容的推荐 3.人口统计式的推荐 4.协调过滤式的推荐 (广泛采用) 协调过滤的概念 在现今的推荐技术和算法中,最被大家广泛认可和 ...

  3. 协同过滤 CF & ALS 及在Spark上的实现

    使用Spark进行ALS编程的例子可以看:http://www.cnblogs.com/charlesblc/p/6165201.html ALS:alternating least squares ...

  4. Spark机器学习(11):协同过滤算法

    协同过滤(Collaborative Filtering,CF)算法是一种常用的推荐算法,它的思想就是找出相似的用户或产品,向用户推荐相似的物品,或者把物品推荐给相似的用户.怎样评价用户对商品的偏好? ...

  5. Spark 基于物品的协同过滤算法实现

    J由于 Spark MLlib 中协同过滤算法只提供了基于模型的协同过滤算法,在网上也没有找到有很好的实现,所以尝试自己实现基于物品的协同过滤算法(使用余弦相似度距离) 算法介绍 基于物品的协同过滤算 ...

  6. 推荐系统算法学习(一)——协同过滤(CF) MF FM FFM

    https://blog.csdn.net/qq_23269761/article/details/81355383 1.协同过滤(CF)[基于内存的协同过滤] 优点:简单,可解释 缺点:在稀疏情况下 ...

  7. 协同过滤CF算法之入门

    数据规整 首先将评分数据从 ratings.dat 中读出到一个 DataFrame 里: >>> import pandas as pd In [2]: import pandas ...

  8. spark MLlib 概念 6:ALS(Alternating Least Squares) or (ALS-WR)

    Large-scale Parallel Collaborative Filtering for the Netflix Prize http://www.hpl.hp.com/personal/Ro ...

  9. 案例:Spark基于用户的协同过滤算法

    https://mp.weixin.qq.com/s?__biz=MzA3MDY0NTMxOQ==&mid=2247484291&idx=1&sn=4599b4e31c2190 ...

随机推荐

  1. O007、KVM 存储虚拟化

    参考https://www.cnblogs.com/CloudMan6/p/5273283.html   KVM 的存储虚拟化是通过存储池(Storage Pool) 和 卷(Volume)来管理的. ...

  2. scala新版本学习(3)

    1.REPL:读取->求值->打印->循环.Scala程序将输入的内容快速的编译成为字节码,然后字节码文件交给Java虚拟机进行执行. 2.val是值不可变,var值可变.在变量声明 ...

  3. textarea 限制输入字数

    一般情况下很多人限制textarea的输入字数会使用 onkeyup 或 onchange事件,但是这两种事件都带有明显的不足. onkeyup 事件只能监听键盘事件,而对于用户的粘贴毫无办法:而on ...

  4. SSD源码解读——网络测试

    之前,对SSD的论文进行了解读,可以回顾之前的博客:https://www.cnblogs.com/dengshunge/p/11665929.html. 为了加深对SSD的理解,因此对SSD的源码进 ...

  5. ubuntu搭建gerrit+gitweb代码审核系统

    一.Gerrit的简介 Gerrit是Google开源的一套基于web的代码review工具,它是基于git的版本管理系统.Google开源Gerrit旨在提供一个轻量级框架,用于在代码入库之前对每个 ...

  6. maven中配置jboss仓库

    有两种方式,一种是在项目的pom.xml中<repositories>中添加,这是配置是针对具体的某一个项目,更多时候,我们想把jboss仓库作为所有项目的仓库,这就需要在maven的se ...

  7. Dubbo 03 Restful风格的API

    目录 Dubbo03 restful风格的API 根路径 协议 版本 用HTTP协议里的动词来实现资源的增删改查 用例 swagger(丝袜哥) OpenAPI 资源 编写API文档 整合Spring ...

  8. Django学习系列16:处理完POST请求后重定向

    处理完POST请求后重定向 代码中new_item_text = ''的写法不怎么样.解决第二个问题时候,顺带把这个问题也解决了. 人们都说处理完post请求后一定要重定向,接下来就实现这个功能吧.修 ...

  9. Mongodb的基本操作-数据库 集合 文档的增删改查

    数据库操作: //查看有哪些数据库 > show dbs local  0.078GB mydb   0.078GB //use操作将切换到一个数据库 如果数据库存在将直接切换 如果不存在 那么 ...

  10. 基于初始种子自动选取的区域生长(python+opencv)

    算法中,初始种子可自动选择(通过不同的划分可以得到不同的种子,可按照自己需要改进算法),图分别为原图(自己画了两笔为了分割成不同区域).灰度图直方图.初始种子图.区域生长结果图.另外,不管时初始种子选 ...