mapReduce编程之Recommender System
1 协同过滤算法
协同过滤算法是现在推荐系统的一种常用算法。分为user-CF和item-CF。
本文的电影推荐系统使用的是item-CF,主要是由于用户数远远大于电影数,构建矩阵的代价更小;另外,电影推荐系统中使用基于物品的推荐对用户来说更有说服力。因此本文对user-CF只做简单介绍,主要介绍item-CF。
1.1 基于用户的协同过滤算法
a 计算出用户两两之间的相似度,得到用户相似度矩阵;
b 预测用户的喜好,使用公式:

其中,p(u,i)表示用户u对物品i的感兴趣程度,S(u,k)表示和用户u兴趣最接近的K个用户,N(i)表示对物品i有过行为的用户集合,Wuv表示用户u和用户v的兴趣相似度,Rvi表示用户v对物品i的兴趣。
c 根据预测出来的喜好度来做推荐。
1.2 基于物品的协同过滤算法
1.2.1 物品相似度计算
物品相似度的计算有多种。在这里使用同现矩阵。其中第m行第n列的元素表示物品m和物品n的相似度,具体是:如果一个用户同时看过电影m和n,则m和n的相似度就加1。还要对如下所示:

之后还要对同现矩阵做归一化,注意归一化之后矩阵不是对称的:

1.2.2 预测用户对未看电影的打分
用户打分的预测值由下式计算:

因此,最后得到的预测矩阵可由同现矩阵与评分矩阵直接相乘得到:

1.2.3 推荐
根据预测的打分,选出未看电影中的topk即生成推荐列表。
2 mapReduce工作流程
2.1 输入数据形式
表示userID, movieID,评分

2.2 总体流程

2.3 MR1
MR1负责数据预处理,将同一个user的数据merge到一起。
mapper负责拆分数据:

reducer负责合并:

2.4 MR2
MR2负责构建同现矩阵。
mapper将一个用户看过的每部电影进行两两组合发送:

reducer负责merge这些值,就得到同现矩阵的每个单元(行号:列号):

2.5 MR3
MR3负责将同现矩阵归一化。
mapper 负责读取上一个MR产生的同现矩阵cells,然后按行号发送到reducer(由于归一化是按行的,所以这里要以行号为Key)。
reducer将得到的一行sum之后,用原来的值除以sum得到归一化的值,然后将每个单元按照列号写入HDFS(按列号写是为之后的矩阵相乘做准备)。
因此,MR3的输入输出如下:

2.6 MR4
MR4将完成矩阵小单元相乘的工作。
mapper1负责读入归一化的同现矩阵的小单元,然后按列号发送(之前已经按列号存储了,这里直接读取并发送就行)

mapper2负责读取输入的rowdata文件,即评分矩阵的每个小单元,然后按行号(movie id)发送:

在reducer中,接收到的值分别来自同现矩阵的第x列和评分矩阵的第x行。我们知道,最终生成的预测矩阵i行j列的小单元(i,j)是等于对应的同现矩阵的(i, x)乘以评分矩阵的(x, j),再对所有x求和。而这里的reducer中聚集了所有x值相同的来自两个矩阵的小单元,因此它们两两之间是可以互乘的。这里我们用=和:来区分两个矩阵的小单元。下图中橘黄色是处于同一个reducer里面的小单元,将来自同现矩阵和评分矩阵的小单元区分开后,将它们两两相乘,得到预测矩阵的行号与列号的不同组合,以它为key写入hdfs。

2.7 MR5
MR5负责将乘积的结果相加。

3 主要代码
DataDividerByUser.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
public class DataDividerByUser {
public static class DataDividerMapper extends Mapper<LongWritable, Text, IntWritable, Text> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//input user,movie,rating
String[] user_movie_rating = value.toString().split(",");
int userId = Integer.parseInt(user_movie_rating[0]);
String outPutKey = user_movie_rating[1] + ":" + user_movie_rating[2];
//divide data by user
context.write(new IntWritable(userId), new Text(outPutKey));
}
}
public static class DataDividerReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
// reduce method
@Override
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
//merge data for one user
for (Text value : values) {
sb.append(value.toString());
sb.append(",");
}
sb.deleteCharAt(sb.length() - 1);
context.write(key, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setMapperClass(DataDividerMapper.class);
job.setReducerClass(DataDividerReducer.class);
job.setJarByClass(DataDividerByUser.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
TextInputFormat.setInputPaths(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
CoOccurrenceMatrixGenerator.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class CoOccurrenceMatrixGenerator {
public static class MatrixGeneratorMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//value = userid \t movie1: rating, movie2: rating...
String[] movie_rating = value.toString().split("\t")[1].split(",");
//key = movie1: movie2 value = 1
//calculate each user rating list: <movieA, movieB>
for (int i = 0; i < movie_rating.length; i++) {
for (int j = 0; j < movie_rating.length; j++) {
String outPutKey = movie_rating[i].split(":")[0] + ":" + movie_rating[j].split(":")[0];
context.write(new Text(outPutKey), new IntWritable(1));
}
}
}
}
public static class MatrixGeneratorReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
// reduce method
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//key movie1:movie2 value = iterable<1, 1, 1>
//calculate each two movies have been watched by how many people
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setMapperClass(MatrixGeneratorMapper.class);
job.setReducerClass(MatrixGeneratorReducer.class);
job.setJarByClass(CoOccurrenceMatrixGenerator.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
TextInputFormat.setInputPaths(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Normalize.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class Normalize {
public static class NormalizeMapper extends Mapper<LongWritable, Text, Text, Text> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//movieA:movieB \t relation
String movieA = value.toString().split("\t")[0].split(":")[0];
String movieB = value.toString().split("\t")[0].split(":")[1];
String relation = value.toString().split("\t")[1];
//collect the relationship list for movieA
context.write(new Text(movieA), new Text(movieB + ":" + relation));
}
}
public static class NormalizeReducer extends Reducer<Text, Text, Text, Text> {
// reduce method
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//key = movieA, value=<movieB:relation, movieC:relation...>
//normalize each unit of co-occurrence matrix
Map<String, Double> map = new HashMap<String, Double>();
double sum = 0;
for (Text value : values) {
String[] movie_relation = value.toString().split(":");
map.put(movie_relation[0], Double.parseDouble(movie_relation[1]));
sum += Double.parseDouble(movie_relation[1]);
}
for (Map.Entry<String, Double> entry : map.entrySet()) {
String outputKey = entry.getKey();
String outputValue = key.toString() + "=" + String.valueOf(entry.getValue() / sum);
context.write(new Text(outputKey), new Text(outputValue));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setMapperClass(NormalizeMapper.class);
job.setReducerClass(NormalizeReducer.class);
job.setJarByClass(Normalize.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
TextInputFormat.setInputPaths(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Multiplication.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class Multiplication {
public static class CooccurrenceMapper extends Mapper<LongWritable, Text, Text, Text> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//input: movieB \t movieA=relation
//pass data to reducer
String[] movieB_movieARelation = value.toString().split("\t");
context.write(new Text(movieB_movieARelation[0]), new Text(movieB_movieARelation[1]));
}
}
public static class RatingMapper extends Mapper<LongWritable, Text, Text, Text> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//input: user,movie,rating
//pass data to reducer
String[] user_movie_rating = value.toString().split(",");
String outputKey = user_movie_rating[0] + ":" + user_movie_rating[2];
context.write(new Text(user_movie_rating[1]), new Text(outputKey));
}
}
public static class MultiplicationReducer extends Reducer<Text, Text, Text, DoubleWritable> {
// reduce method
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//key = movieB value = <movieA=relation, movieC=relation... userA:rating, userB:rating...>
//collect the data for each movie, then do the multiplication
Map<String, Double> coMap = new HashMap<String, Double>();
Map<String, Double> ratingMap = new HashMap<String, Double>();
for (Text value : values) {
String s = value.toString();
if (s.contains("=")) {
coMap.put(s.split("=")[0], Double.parseDouble(s.split("=")[1]));
} else {
ratingMap.put(s.split(":")[0], Double.parseDouble(s.split(":")[1]));
}
}
for (Map.Entry<String, Double> entry1 : coMap.entrySet()) {
for (Map.Entry<String, Double> entry2 : ratingMap.entrySet()) {
double mult = entry1.getValue() * entry2.getValue();
String outputKey = entry2.getKey() + ":" + entry1.getKey();
context.write(new Text(outputKey), new DoubleWritable(mult));
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(Multiplication.class);
ChainMapper.addMapper(job, CooccurrenceMapper.class, LongWritable.class, Text.class, Text.class, Text.class, conf);
ChainMapper.addMapper(job, RatingMapper.class, Text.class, Text.class, Text.class, Text.class, conf);
job.setMapperClass(CooccurrenceMapper.class);
job.setMapperClass(RatingMapper.class);
job.setReducerClass(MultiplicationReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CooccurrenceMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, RatingMapper.class);
TextOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
Sum.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
/**
* Created by Michelle on 11/12/16.
*/
public class Sum {
public static class SumMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
// map method
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//pass data to reducer
String[] key_value = value.toString().split("\t");
context.write(new Text(key_value[0]), new DoubleWritable(Double.parseDouble(key_value[1])));
}
}
public static class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
// reduce method
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
//user:movie relation
//calculate the sum
double sum = 0;
for (DoubleWritable value : values) {
sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setMapperClass(SumMapper.class);
job.setReducerClass(SumReducer.class);
job.setJarByClass(Sum.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
TextInputFormat.setInputPaths(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Driver.java
public class Driver {
public static void main(String[] args) throws Exception {
DataDividerByUser dataDividerByUser = new DataDividerByUser();
CoOccurrenceMatrixGenerator coOccurrenceMatrixGenerator = new CoOccurrenceMatrixGenerator();
Normalize normalize = new Normalize();
Multiplication multiplication = new Multiplication();
Sum sum = new Sum();
String rawInput = args[0];
String userMovieListOutputDir = args[1];
String coOccurrenceMatrixDir = args[2];
String normalizeDir = args[3];
String multiplicationDir = args[4];
String sumDir = args[5];
String[] path1 = {rawInput, userMovieListOutputDir};
String[] path2 = {userMovieListOutputDir, coOccurrenceMatrixDir};
String[] path3 = {coOccurrenceMatrixDir, normalizeDir};
String[] path4 = {normalizeDir, rawInput, multiplicationDir};
String[] path5 = {multiplicationDir, sumDir};
dataDividerByUser.main(path1);
coOccurrenceMatrixGenerator.main(path2);
normalize.main(path3);
multiplication.main(path4);
sum.main(path5);
}
}
mapReduce编程之Recommender System的更多相关文章
- MapReduce编程之wordcount
实践 MapReduce编程之wordcount import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Fi ...
- MapReduce编程之Reduce Join多种应用场景与使用
在关系型数据库中 Join 是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求, 例如在数据分析时需要连接从不同的数据源中获取到数据.不同于传统的单机模式 ...
- MapReduce编程之Semi Join多种应用场景与使用
Map Join 实现方式一 ● 使用场景:一个大表(整张表内存放不下,但表中的key内存放得下),一个超大表 ● 实现方式:分布式缓存 ● 用法: SemiJoin就是所谓的半连接,其实仔细一看就是 ...
- MapReduce编程之Map Join多种应用场景与使用
Map Join 实现方式一:分布式缓存 ● 使用场景:一张表十分小.一张表很大. ● 用法: 在提交作业的时候先将小表文件放到该作业的DistributedCache中,然后从DistributeC ...
- mapReduce编程之auto complete
1 n-gram模型与auto complete n-gram模型是假设文本中一个词出现的概率只与它前面的N-1个词相关.auto complete的原理就是,根据用户输入的词,将后续出现概率较大的词 ...
- mapReduce编程之google pageRank
1 pagerank算法介绍 1.1 pagerank的假设 数量假设:每个网页都会给它的链接网页投票,假设这个网页有n个链接,则该网页给每个链接平分投1/n票. 质量假设:一个网页的pagerank ...
- Hadoop基础-Map端链式编程之MapReduce统计TopN示例
Hadoop基础-Map端链式编程之MapReduce统计TopN示例 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.项目需求 对“temp.txt”中的数据进行分析,统计出各 ...
- C#可扩展编程之MEF学习笔记(五):MEF高级进阶
好久没有写博客了,今天抽空继续写MEF系列的文章.有园友提出这种系列的文章要做个目录,看起来方便,所以就抽空做了一个,放到每篇文章的最后. 前面四篇讲了MEF的基础知识,学完了前四篇,MEF中比较常用 ...
- C#可扩展编程之MEF学习笔记(四):见证奇迹的时刻
前面三篇讲了MEF的基础和基本到导入导出方法,下面就是见证MEF真正魅力所在的时刻.如果没有看过前面的文章,请到我的博客首页查看. 前面我们都是在一个项目中写了一个类来测试的,但实际开发中,我们往往要 ...
随机推荐
- 自定义UICollectionLayout布局 —— UIKit之学习UICollectionView记录一《瀑布流》
一.思路 思路一:比较每一行所有列的cell的高度,从上到下(也就是从第一行开始),从最短的开始计算,(记录下b的高度和索引,从开始计算,依次类推) 思路二:设置上.下.左.右间距和行间距.列间距及列 ...
- LinkedIn的即时消息:在一台机器上支持几十万条长连接
最近我们介绍了LinkedIn的即时通信,最后提到了分型指标和读回复.为了实现这些功能,我们需要有办法通过长连接来把数据从服务器端推送到手机或网页客户端,而不是许多当代应用所采取的标准的请求-响应模式 ...
- 百度数据可视化图表套件echart实战
最近我一直在做数据可视化的前端工作,我用的最多的绘图工具是d3.d3有点像photoshop,功能很强大,例子也很多,但是学习成本也不低,做项目是需要较大人力投入的.3月底由在亚马逊工作的同学介绍下使 ...
- centos7.0 安装字体库
最近在centos7.0下用itextpdf将word文档转成pdf时出现字体丢失的情况.网上找了很多资料,各式各样的原因和解决方法.后来经过一番测试发现是centos7.0 minimal没有安装相 ...
- 淘宝WAP版小BUG分析
前几天发现的一个淘宝WAP版的小BUG,就是用桌面版chrome看的时候产品评价中的图片显示不出来,都是图裂了. 这是什么原因呢?图片为什么会显示不出来呢?淘宝的技术人员.测试人员不可能没发现啊.开启 ...
- [转]在html中控制自动换行
其实只要在表格控制中添加一句 <td style="word-break:break-all">就搞定了. 其中可能对英文换行可能会分开一个单词问题:解决如下: 语法: ...
- yii2.0归档安装方法
我前几天用composer安装 一直没成功 我就用归档的方法安装了 所以这篇文字只帮助那些用归档方法安装的朋友 Yii是一个高性能的,适用于开发WEB2.0应用的PHP框架. Yii自带了丰富的功 ...
- css 拾遗
1, 实现尖角 <style> .up{ border-top: 30px solid red; border-right:30px solid gold; border-bottom:3 ...
- 怎么样修改小猪cms(从功能库添加)模块关键字
需求:修改或者添加从功能库添加中的关键字 这里以添加咨询投诉为列: 找到wwwroot\PigCms\Lib\Action\User目录下的LinkAction.class.php文件(手动找不到直接 ...
- LCIS
传送门 http://bestcoder.hdu.edu.cn/contests/contest_chineseproblem.php?cid=726&pid=1003 分析:这道题依然是动态 ...