spark MLlib 概念 4：协同过滤（CF）

1. 定义

协同过滤（Collaborative Filtering）有狭义和广义两种意义：

广义协同过滤：对来源不同的数据，根据他们的共同点做过滤处理。

Collaborative filtering (CF) is a technique used by some recommender systems.^[1] Collaborative filtering has two senses, a narrow one and a more general one.^[2] In general, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.^[2]

协议的协同过滤：它假设如果用户A和B对一个话题Y有相同的兴趣，那么A和B对另一个话题X都感兴趣的概率比随机抽取两个人且都对话题X感兴趣的概率高。所以它能根据收集到的用户兴趣预测一个用户是否对某个话题感兴趣。

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.

下面这个图非常形象的说明了协议的CF的工作原理：

来源： <http://en.wikipedia.org/wiki/Collaborative_filtering>

2. 实现过程

协同过滤一般需要1）用户积极参与；2）用户表达他们的兴趣；3）能够找出相同兴趣用户的算法。

推荐系统典型的工作流程如下：

用户对一些物品评分来表达他们的兴趣或偏好；
系统根据评分找出“最相似”用户；
对于兴趣相似的用户组内部，如果一部分用户A对某物品X给了很高评分，而其他的用户Y还没有对X评分（还没参与，比如对于一部电影，可能还没看过），就把X推荐给用户Y。

关键的问题在于如果对“相邻”的用户的兴趣进行组合和分配权重。有时，被推荐某物品X的用户会立即对X进行评分，所以推荐系统可以不断的提高自己的精度。

Collaborative filtering algorithms often require (1) users’ active participation, (2) an easy way to represent users’ interests to the system, and (3) algorithms that are able to match people with similar interests.

Typically, the workflow of a collaborative filtering system is:

A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.
The system matches this user’s ratings against other users’ and finds the people with most “similar” tastes.
With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

A key problem of collaborative filtering is how to combine and weight the preferences of user neighbors. Sometimes, users can immediately rate the recommended items. As a result, the system gains an increasingly accurate representation of user preferences over time.

3. 公式

Memory-based[edit]

一般情况，有基于邻近关系、基于用户、基于话题三种类型的推荐系统。例如，

基于用户u对话题i的评分，可以通过“相似用户”的评分，通过一个聚合函数（aggr）计算出来。

Typical examples of this mechanism are neighbourhood based CF and item-based/user-based top-N recommendations.[3] For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an aggregation of some similar users rating to the item:

其中，U代表与u最相似且已经对i话题评分的N个用户集合

where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'.

常见的聚合函数（aggr）有：

Some examples of the aggregation function includes:

平均值

相似度加权平均值

修正式相似度加权平均值

where k is a normalizing factor defined as . and is the average rating of user u for all the items rated by that user.

基于相似度的算法：

找出用户或话题相似度是它的重要部分。一般使用皮尔森相似度（平移无关特点）和余弦相似度算法。

The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple mechanisms such as Pearson correlation and vector cosinebased similarity are used for this.

The Pearson correlation similarity of two users x, y is defined as

where I_xy is the set of items rated by both user x and user y.

The cosine-based approach defines the cosine-similarity between two users x and y as:^[4]

当找到最相似的N个用户后，使用knn算法将对应的度量属性聚合到被推荐的用户上。

The user based top-N recommendation algorithm identifies the k most similar users to an active user using similarity based vector model. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive hashing, which implements the nearest neighbor mechanism in linear time.

优点：

结果很容易解析（这点对推荐系统很重要）；
容易使用；
可以增量计算；
不需要关心话题的内容；
对协同评分的话题扩展性好？

The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; it is easy to create and use; new data can be added easily and incrementally; it need not consider the content of the items being recommended; and the mechanism scales well with co-rated items.

缺点：

依赖于人的评分；
稀疏数据性能差，而稀疏数据是很平常的
对大的数据集扩展性差

There are several disadvantages with this approach. First, it depends on human ratings. Second, its performance decreases when data gets sparse, which is frequent with web related items. This prevents the scalability of this approach and has problems with large datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes more complicated since that representation usually relies on a specific vector space. That would require to include the new item and re-insert all the elements in the structure.

RECOMMENDER SYSTEMS

Model-based[edit]

基于模型的CF。

Models are developed using data mining, machine learning algorithms to find patterns based on training data. These are used to make predictions for real data. There are many model-based CF algorithms. These include Bayesian networks, clustering models, latent semantic models such as singular value decomposition,probabilistic latent semantic analysis, Multiple Multiplicative Factor, Latent Dirichlet allocation and markov decision process based models.^[5]

这种方法通过更整合的目的去揭露潜在的属性，且大部分通过某个分类或聚类技术实现，参数的数量可通过PCA来减少。

This approach has a more holistic goal to uncover latent factors that explain observed ratings.^[6] Most of the models are based on creating a classification or clustering technique to identify the user based on the test set. The number of the parameters can be reduced based on types of principal component analysis.

优点：稀疏数据和大数据扩展性好

There are several advantages with this paradigm. It handles the sparsity better than memory based ones. This helps with scalability with large data sets. It improves the prediction performance. It gives an intuitive rationale for the recommendations.

缺点：需要对预测质量和扩展性之间做出评估。参数的减少可能导致一些有用信息丢失。

The disadvantages with this approach are in the expensive model building. One needs to have a tradeoff between prediction performance and scalability. One can lose useful information due to reduction models. A number of models have difficulty explaining the predictions.

来自为知笔记(Wiz)