Mahout推荐算法之SlopOne
Mahout推荐算法之SlopOne
一、 算法原理
有别于基于用户的协同过滤和基于item的协同过滤,SlopeOne采用简单的线性模型估计用户对item的评分。如下图,估计UserB对ItemJ的偏好
图(1)
在真实情况下,该方法有如下几个问题:
1. 为什么要选择UserA计算?
2. 对大量稀疏的情况如何处理,而这种情况是最为普遍的。
图(2)
Item1和item2的相似度:((5-3)+(3-4))/2=0.5
Item1和Item3的相似度:(5-2)/1=3
Lucy对Item1的评估预估计为:((2+0.5)*2+(3+5)*1)/(2+1)=4.333
Item3和Item1的相似度:(2-3)/1=-1
Item3和Item2的相似度:(5-2)/1=3
Make对item3的评分预估计为:((4+3)*1+(3-1)*1)/(1+1)=4.5
通过以上例子可以看出:需要计算item对之间的平均差别,以及item对之间的差别次数。
Mahout给出的训练伪代码:
|
for every item i for every other item j for every user u expressing preference for both i and j add the difference in u’s preference for i and j to an average |
推荐伪代码:
for every item i the user u expresses no preference for for every item j that user u expresses a preference for find the average preference difference between j and i add this diff to u’s preference value for j add this to a running average return the top items, ranked by these averages |
二、 单机模型实现
(一) 构建difference
1. 单机模型构建(MemoryDiffStorage)
private void buildAverageDiffs() throws TasteException { log.info("Building average diffs...");
try {
buildAverageDiffsLock.writeLock().lock(); averageDiffs.clear(); long averageCount = 0L; LongPrimitiveIterator it = dataModel.getUserIDs(); while (it.hasNext()) {
averageCount = processOneUser(averageCount, it.nextLong()); } pruneInconsequentialDiffs(); updateAllRecommendableItems(); } finally {
buildAverageDiffsLock.writeLock().unlock(); } } private void pruneInconsequentialDiffs() {
// Go back and prune inconsequential diffs. "Inconsequential" means, here, only represented by one // data point, so possibly unreliable Iterator<Map.Entry<Long,FastByIDMap<RunningAverage>>> it1 = averageDiffs.entrySet().iterator(); while (it1.hasNext()) {
FastByIDMap<RunningAverage> map = it1.next().getValue(); Iterator<Map.Entry<Long,RunningAverage>> it2 = map.entrySet().iterator(); while (it2.hasNext()) {
RunningAverage average = it2.next().getValue(); if (average.getCount() <= 1) {
it2.remove(); } } if (map.isEmpty()) {
it1.remove(); } else {
map.rehash(); } } averageDiffs.rehash(); } private void updateAllRecommendableItems() throws TasteException {
FastIDSet ids = new FastIDSet(dataModel.getNumItems()); for (Map.Entry<Long,FastByIDMap<RunningAverage>> entry : averageDiffs.entrySet()) {
ids.add(entry.getKey()); LongPrimitiveIterator it = entry.getValue().keySetIterator(); while (it.hasNext()) {
ids.add(it.next()); } } allRecommendableItemIDs.clear(); allRecommendableItemIDs.addAll(ids); allRecommendableItemIDs.rehash(); } private long processOneUser(long averageCount, long userID) throws TasteException {
log.debug("Processing prefs for user {}", userID);
// Save off prefs for the life of this loop iteration PreferenceArray userPreferences = dataModel.getPreferencesFromUser(userID); int length = userPreferences.length(); for (int i = 0; i < length; i++) { // Loop to length-1, not length-2, not for diffs but average item pref
float prefAValue = userPreferences.getValue(i); long itemIDA = userPreferences.getItemID(i); FastByIDMap<RunningAverage> aMap = averageDiffs.get(itemIDA); if (aMap == null) {
aMap = new FastByIDMap<RunningAverage>(); averageDiffs.put(itemIDA, aMap); } for (int j = i + 1; j < length; j++) {
// This is a performance-critical block long itemIDB = userPreferences.getItemID(j); RunningAverage average = aMap.get(itemIDB); if (average == null && averageCount < maxEntries) {
average = buildRunningAverage(); aMap.put(itemIDB, average); averageCount++; } if (average != null) {
average.addDatum(userPreferences.getValue(j) - prefAValue); } } RunningAverage itemAverage = averageItemPref.get(itemIDA); if (itemAverage == null) {
itemAverage = buildRunningAverage(); averageItemPref.put(itemIDA, itemAverage); } itemAverage.addDatum(prefAValue); } return averageCount; } private RunningAverage buildRunningAverage() {
return stdDevWeighted ? new FullRunningAverageAndStdDev() : new FullRunningAverage(); } |
2. MapReduce模式构建(FileDiffStorage)
用MapReduce模式计算difference的部分参看下文。该方式是离线计算模式,不能实施更新,适合大数据量。由于mapreduce模式计算了所有item之间的全部值,故比单机模式更准确。构建好之后拷贝到本地,使用用FileDiffStorage(newFile("diff"),
500) 即可。FileDiffStorage不支持添加和删除pereference(实际上也是不能这么做的);
|
private void buildDiffs() { if (buildAverageDiffsLock.writeLock().tryLock()) { try { averageDiffs.clear(); allRecommendableItemIDs.clear(); FileLineIterator iterator = new FileLineIterator(dataFile, false); String firstLine = iterator.peek(); while (firstLine.isEmpty() || firstLine.charAt(0) == COMMENT_CHAR) { iterator.next(); firstLine = iterator.peek(); } long averageCount = 0L; while (iterator.hasNext()) { averageCount = processLine(iterator.next(), averageCount); } pruneInconsequentialDiffs(); updateAllRecommendableItems(); } catch (IOException ioe) { log.warn("Exception while reloading", ioe); } finally { buildAverageDiffsLock.writeLock().unlock(); } } } private long processLine(String line, long averageCount) { if (line.isEmpty() || line.charAt(0) == COMMENT_CHAR) { return averageCount; } String[] tokens = SEPARATOR.split(line); Preconditions.checkArgument(tokens.length >= 3 && tokens.length != 5, "Bad line: %s", line); long itemID1 = Long.parseLong(tokens[0]); long itemID2 = Long.parseLong(tokens[1]); double diff = Double.parseDouble(tokens[2]); int count = tokens.length >= 4 ? Integer.parseInt(tokens[3]) : 1; boolean hasMkSk = tokens.length >= 5; if (itemID1 > itemID2) { long temp = itemID1; itemID1 = itemID2; itemID2 = temp; } FastByIDMap<RunningAverage> level1Map = averageDiffs.get(itemID1); if (level1Map == null) { level1Map = new FastByIDMap<RunningAverage>(); averageDiffs.put(itemID1, level1Map); } RunningAverage average = level1Map.get(itemID2); if (average != null) { throw new IllegalArgumentException("Duplicated line for item-item pair " + itemID1 + " / " + itemID2); } if (averageCount < maxEntries) { if (hasMkSk) { double mk = Double.parseDouble(tokens[4]); double sk = Double.parseDouble(tokens[5]); average = new FullRunningAverageAndStdDev(count, diff, mk, sk); } else { average = new FullRunningAverage(count, diff); } level1Map.put(itemID2, average); averageCount++; } allRecommendableItemIDs.add(itemID1); allRecommendableItemIDs.add(itemID2); return averageCount; } private void pruneInconsequentialDiffs() { // Go back and prune inconsequential diffs. "Inconsequential" means, here, only represented by one // data point, so possibly unreliable Iterator<Map.Entry<Long,FastByIDMap<RunningAverage>>> it1 = averageDiffs.entrySet().iterator(); while (it1.hasNext()) { FastByIDMap<RunningAverage> map = it1.next().getValue(); Iterator<Map.Entry<Long,RunningAverage>> it2 = map.entrySet().iterator(); while (it2.hasNext()) { RunningAverage average = it2.next().getValue(); if (average.getCount() <= 1) { it2.remove(); } } if (map.isEmpty()) { it1.remove(); } else { map.rehash(); } } averageDiffs.rehash(); } private void updateAllRecommendableItems() { for (Map.Entry<Long,FastByIDMap<RunningAverage>> entry : averageDiffs.entrySet()) { allRecommendableItemIDs.add(entry.getKey()); LongPrimitiveIterator it = entry.getValue().keySetIterator(); while (it.hasNext()) { allRecommendableItemIDs.add(it.next()); } } allRecommendableItemIDs.rehash(); } |
(二) 估值
private float doEstimatePreference(long userID, long itemID) throws TasteException {
double count = 0.0; double totalPreference = 0.0; PreferenceArray prefs = getDataModel().getPreferencesFromUser(userID); RunningAverage[] averages = diffStorage.getDiffs(userID, itemID, prefs); int size = prefs.length(); for (int i = 0; i < size; i++) {
RunningAverage averageDiff = averages[i]; if (averageDiff != null) {
double averageDiffValue = averageDiff.getAverage(); if (weighted) {
double weight = averageDiff.getCount(); if (stdDevWeighted) {
double stdev = ((RunningAverageAndStdDev) averageDiff).getStandardDeviation(); if (!Double.isNaN(stdev)) {
weight /= 1.0 + stdev; } // If stdev is NaN, then it is because count is 1. Because we're weighting by count, // the weight is already relatively low. We effectively assume stdev is 0.0 here and // that is reasonable enough. Otherwise, dividing by NaN would yield a weight of NaN // and disqualify this pref entirely } totalPreference += weight * (prefs.getValue(i) + averageDiffValue); count += weight; } else {
totalPreference += prefs.getValue(i) + averageDiffValue; count += 1.0; } } } if (count <= 0.0) {
RunningAverage itemAverage = diffStorage.getAverageItemPref(itemID); return itemAverage == null ? Float.NaN : (float) itemAverage.getAverage(); } else {
return (float) (totalPreference / count); } } |
(三) 推荐
对于在线推荐系统,允许只有一个SlopeOneRecommender实例。
|
方法签名 |
说明 |
备注 |
|
public void setPreference(long userID, |
添加偏好,线上系统经常需要。 |
动态添加偏好,添加之后会更新ItemID的和其他Item之间的相似度 |
|
public void removePreference(long userID, |
删除偏好,很少用。 |
删除偏好后,会更新itemId和其他Item之间的相似度 |
|
public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) |
提供推荐。IDRescorer用于商业规则,调整item的得分 |
1.获取userId还未评分的item作为候选。2.估计每个Item的得分,选取topk 返回。 |
|
public float estimatePreference(long userID,long itemID) |
估计userId对ItemId的评分 |
如userId对itemId有真实的值,则返回,否则估计。 |
1. 推荐接口
public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {
Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1"); log.debug("Recommending items for user ID '{}'", userID);
FastIDSet possibleItemIDs = diffStorage.getRecommendableItemIDs(userID); TopItems.Estimator<Long> estimator = new Estimator(userID); List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer, estimator); log.debug("Recommendations are: {}", topItems);
return topItems; } |
2. 获取推荐候选项
public FastIDSet getRecommendableItemIDs(long userID) throws TasteException {
FastIDSet result; try {
buildAverageDiffsLock.readLock().lock(); result = allRecommendableItemIDs.clone(); } finally {
buildAverageDiffsLock.readLock().unlock(); } Iterator<Long> it = result.iterator(); while (it.hasNext()) {
if (dataModel.getPreferenceValue(userID, it.next()) != null) {
it.remove(); } } return result; } |
3. 估计候选项的得分,返回topK个推荐列表
public static List<RecommendedItem> getTopItems(int howMany, LongPrimitiveIterator possibleItemIDs, IDRescorer rescorer, Estimator<Long> estimator) throws TasteException {
Preconditions.checkArgument(possibleItemIDs != null, "argument is null"); Preconditions.checkArgument(estimator != null, "argument is null"); Queue<RecommendedItem> topItems = new PriorityQueue<RecommendedItem>(howMany + 1, Collections.reverseOrder(ByValueRecommendedItemComparator.getInstance())); boolean full = false; double lowestTopValue = Double.NEGATIVE_INFINITY; while (possibleItemIDs.hasNext()) {
long itemID = possibleItemIDs.next(); if (rescorer == null || !rescorer.isFiltered(itemID)) {
double preference; try {
preference = estimator.estimate(itemID); } catch (NoSuchItemException nsie) {
continue; } double rescoredPref = rescorer == null ? preference : rescorer.rescore(itemID, preference); if (!Double.isNaN(rescoredPref) && (!full || rescoredPref > lowestTopValue)) {
topItems.add(new GenericRecommendedItem(itemID, (float) rescoredPref)); if (full) {
topItems.poll(); } else if (topItems.size() > howMany) {
full = true; topItems.poll(); } lowestTopValue = topItems.peek().getValue(); } } } int size = topItems.size(); if (size == 0) {
return Collections.emptyList(); } List<RecommendedItem> result = Lists.newArrayListWithCapacity(size); result.addAll(topItems); Collections.sort(result, ByValueRecommendedItemComparator.getInstance()); return result; } |
三、 MapReduce实现(计算diff)
1. 计算每个user的item之间的差值
Map: 输入,文本文件,格式为:userId\t itemId\t val 输出:key userId,value itemId\t val |
Reduce: for(user u :users){
items of u for(int I =0 ;i<items.size;i++){
itema =items[i]; for(int j =i+1;j<items.size;j++){
itemb= items[j]; itemABdiff=itemb-itema; out.write(itemA\t itemb, itemABdiff); } } } |
2. 计算itemPair的全局平均
Map:输出数据不做处理,将item相同的数据传递到同一个reduce中。 |
Reduce: 输入 key itemA\t itemb ,val itemABdiff 计算改组数据的平均值(FullRunningAverageAndStdDev) 输出: key EntityEntityWritable ,valueFullRunningAverageAndStdDevWritable |
四、 实例演示
(一) 单机模式
MemoryDiffStorage mds =new MemoryDiffStorage(new FileDataModel(new File("pereference")), Weighting.WEIGHTED, 1000);
SlopeOneRecommender sr =new SlopeOneRecommender(new FileDataModel(new File("pereference")),Weighting.WEIGHTED,Weighting.WEIGHTED,mds);
System.out.println(sr.recommend(1, 10,new IDRescorer() {
@Override public double rescore(long id, double originalScore) {
int clickCount =10;//id的点击量 return originalScore*clickCount; } @Override public boolean isFiltered(long id) {
//如果id和要推荐的item的id属于同一个类型,return false ,否则return true ; return false; } })); |
(二) MapReduce模式
String [] arg ={"-i","p","-o","diff"};
SlopeOneAverageDiffsJob.main(arg); DiffStorage ds =new FileDiffStorage(new File("diff"), 1000);
SlopeOneRecommender sr =new SlopeOneRecommender(new FileDataModel(new File("pereference")),Weighting.WEIGHTED,Weighting.WEIGHTED,mds);
System.out.println(sr.recommend(1, 10,new IDRescorer() {
@Override public double rescore(long id, double originalScore) {
int clickCount =10;//id的点击量 return originalScore*clickCount; } @Override public boolean isFiltered(long id) {
//如果id和要推荐的item的id属于同一个类型,return false ,否则return true ; return false; } })); |
五、 参考文献
1. http://en.wikipedia.org/wiki/Slope_One
2. DanielLemire, Anna Maclachlan, SlopeOne Predictors for Online Rating-Based Collaborative Filtering
3. PuWang, HongWu Ye, A Personalized Recommendation Algorithm Combining Slope OneScheme and User Based Collaborative Filtering
4. DeJiaZhang, An Item-based Collaborative Filtering Recommendation AlgorithmUsing Slope One Scheme Smoothing
5. Mi,Zhenzhen and Xu, Congfu, A Recommendation Algorithm Combining Clustering Methodand Slope One Scheme
1. BadrulM. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Item-basedcollaborative filtering recommendation algorithms
2. GregLinden, Brent Smith, Jeremy York, "Amazon.com Recommendations:Item-to-Item Collaborative Filterin
Mahout推荐算法之SlopOne的更多相关文章
- Mahout推荐算法API详解
转载自:http://blog.fens.me/mahout-recommendation-api/ Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, ...
- Mahout推荐算法基础
转载自(http://www.geek521.com/?p=1423) Mahout推荐算法分为以下几大类 GenericUserBasedRecommender 算法: 1.基于用户的相似度 2.相 ...
- 转】Mahout推荐算法API详解
原博文出自于: http://blog.fens.me/mahout-recommendation-api/ 感谢! Posted: Oct 21, 2013 Tags: itemCFknnMahou ...
- [转]Mahout推荐算法API详解
Mahout推荐算法API详解 Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeepe ...
- Mahout推荐算法API具体解释【一起学Mahout】
阅读导读: 1.mahout单机内存算法实现和分布式算法实现分别存在哪些问题? 2.算法评判标准有哪些? 3.什么会影响算法的评分? 1. Mahout推荐算法介绍 Mahout推荐算法,从数据处理能 ...
- Mahout推荐算法ItemBased
Mahout推荐的ItemBased 一. 算法原理 (一) 基本的 下面的例子,参见图评分矩阵:表现user,归类为item. 图(1) 该算法的原理: 1. 计算Item之间的相似度. ...
- Mahout推荐算法之ItemBased
Mahout推荐之ItemBased 一. 算法原理 (一) 基本原理 如下图评分矩阵所示:行为user,列为item. 图(1) 该算法的原理: 1. 计算Item之间的相似度. 2. ...
- 从源代码剖析Mahout推荐引擎
转载自:http://blog.fens.me/mahout-recommend-engine/ Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pi ...
- 转】从源代码剖析Mahout推荐引擎
原博文出自于: http://blog.fens.me/mahout-recommend-engine/ 感谢! 从源代码剖析Mahout推荐引擎 Hadoop家族系列文章,主要介绍Hadoop家族产 ...
随机推荐
- 利用Express和ejs编写简单页面
1.创建临时文件夹ejsdemo $ mkdir ejsdemo 2.进入ejsdemo 初始化项目 $ npm init 3.安装express $ npm install express --sa ...
- Android ImgView属性
ImageView是用于界面上显示图片的控件. 属性 1.为ImageView设置图片 ①android:src="@drawable/img1": src设置图片,默认图片等比例 ...
- REACT相关资料合集
===实例=== https://github.com/jesseskinner/react-webpack-demo ===UI组件库=== https://github.com/amazeui/a ...
- 排序算法的C语言实现(上 比较类排序:插入排序、快速排序与归并排序)
总述:排序是指将元素集合按规定的顺序排列.通常有两种排序方法:升序排列和降序排列.例如,如整数集{6,8,9,5}进行升序排列,结果为{5,6,8,9},对其进行降序排列结果为{9,8,6,5}.虽然 ...
- PHP XML SimpleXML
PHP 可以基于 SimpleXML 生成和解析 xml 的方法,通过本节的实例,你将了解 PHP 是如何使用 SimpleXML 生成及解析 xml 格式数据的. PHP SimpleXML 处理最 ...
- Android开发技巧——使用Drawable实现小红点
在产品的设计中,总难免需要我们开发去实现各种各样的小红点,小红点,小红点. 通常,我们可能会这样做: 用一个View实现小红点,放在相对布局里,设置好内边距或外边距,让它位于图片的右上角. 或者是给图 ...
- Android中软键盘弹出时底部菜单上移问题
当在Android的layout设计里面如果输入框过多,则在输入弹出软键盘的时候,下面的输入框会有一部分被软件盘挡住,从而不能获取焦点输入. 解决办法: 方法一:在你的activity中的oncrea ...
- 好久没有写BLOG了,人老了就开始变懒了【非技术】
算算到今天,在码农的路上已经走了15年了.马上就40不惑的我,现在是充满的疑惑.
- markdown绘图插件----mermaid简介
作者:黄永刚 mermaid简介 当撰写文档的时候,对于流程图的生成大多使用Visio等繁重的工具,没有一种轻便的工具能够画图从而简化文档的编写,就像markdown那样. mermaid解决这个痛点 ...
- cassandra 常见问题
摘要 本文主要介绍在部署cassandra集群以及使用cassandra过程中遇到的一些问题. 文章只发布在CSDN 和个人站点 更多nosql文章可以访问stone fang 个人主页 正文 Q1: ...