MapReduce实例-基于内容的推荐（一）

环境：
　　Hadoop1.x，CentOS6.5，三台虚拟机搭建的模拟分布式环境

　　数据：下载的amazon产品共同采购网络元数据（需翻墙下载）http://snap.stanford.edu/data/amazon-meta.html

方案目标：

　　从数据中提取出每个用户买过哪些商品，根据买过的商品以及商品之间的相关性来对用户进行推荐商品

　　下载的数据如下所示为单位

Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]
|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]
|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5

思路：

　　整套程序需要分解为两个步骤。1.提取每个用户买过哪些商品。2.根据第一步产生的数据，结合用户的感兴趣度与商品之间的关联生成推荐商品

本篇文章主要做第一步。

这一步骤的主要难点是对自定义输入格式的编写。

1.自定义格式化输入数据

　　如上所示的数据，需要自定义输入数据的格式来提取数据。

　　job.setInputFormatClass(TestAmazonDataFormat.class);

　　那怎么做自定义输入格式呢？

　　这里我们需要了解文件在HDFS中的处理方式。我们知道文件在放入HDFS中时会进行分片。因此我们要对数据进行操作的时候，需要获取文件的信息（文件名、path、开始位置、长度、位于哪个节点等）。

传入文件信息：

//获取文件信息

public class TestAmazonDataFormat extends FileInputFormat<Text, Text> {

     TestAmazonDataReader datareader;

    @Override

    public RecordReader<Text, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext attempt)

            throws IOException, InterruptedException {

        datareader = new TestAmazonDataReader();

        datareader.initialize(inputSplit, attempt);    //传入文件信息

        // TODO Auto-generated method stub

        return datareader;

    }

}

读取文件：

package ren.snail;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URI;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**

 * @author Srinath Perera (hemapani@apache.org)

 */

public class TestAmazonDataReader extends RecordReader<Text, Text> {

    private static Pattern pattern1 = Pattern.compile(

            "\\s+([^\\s]+)\\s+cutomer:\\s+([^\\s]+)\\s+rating:\\s+([^\\s]+)\\s+votes:\\s+([^\\s]+)\\s+helpful:\\s+([^\\s]+).*");

    private BufferedReader reader;

    private int count = 0;

    private Text key;

    private Text value;

    private StringBuffer currentLineData = new StringBuffer();

    String line = null;

    public TestAmazonDataReader() {

    }

    public void initialize(InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException {

        // TODO Auto-generated method stub

        Path path = ((FileSplit) inputSplit).getPath();

        FileSystem fs = FileSystem.get(URI.create(path.toString()), attempt.getConfiguration());     //这里需要注意：由于fs.open的格式为file:///,而path获取的为HDFS的hdfs://XXXXX，因此需要在此进行转换

        // FileSystem fs = FileSystem.get(attempt.getConfiguration());

        FSDataInputStream fsStream = fs.open(path);

        reader = new BufferedReader(new InputStreamReader(fsStream), 1024 * 100);

        while ((line = reader.readLine()) != null) {

            if (line.startsWith("Id:")) {

                break;

            }

        }

    }

    // define key and value

    @Override

    public boolean nextKeyValue() throws IOException, InterruptedException {

        // TODO Auto-generated method stub

         currentLineData = new StringBuffer();

            count++;

            boolean readingreview = false;

        while ((line = reader.readLine()) != null) {

             if(line.trim().length() == 0){

                    value = new Text(currentLineData.toString());

                    return true;

                }

             else {

                if (readingreview) {

                        Matcher matcher = pattern1.matcher(line);

                        if(matcher.matches())

                        {

                            currentLineData.append("review=").append(matcher.group(2)).append("|")

                            .append(matcher.group(3)).append("|")

                            .append(matcher.group(4)).append("|")

                            .append(matcher.group(5)).append("#");

                        }

                        else{

                            System.out.println("review "+ line + "does not match");

                        }

                } else {

                     int indexOf = line.indexOf(":");

                        if(indexOf > 0){

                            String key = line.substring(0,indexOf).trim();

                            String value = line.substring(indexOf+1).trim();

                            if(value == null || value.length() == 0){

                                continue;

                            }

                            if(value.indexOf("#") > 0){

                                value = value.replaceAll("#", "_");

                            }

                            if(key.equals("ASIN") || key.equals("Id") || key.equals("title") || key.equals("group") || key.equals("salesrank")){

                                if(key.equals("ASIN")){

                                    this.key = new Text(value);

                                }

                                currentLineData.append(key).append("=").append(value.replaceAll(",", "")).append("#");

                            }else  if(key.equals("similar")){

                                String[] tokens = value.split("\\s+");

                                //yes we skip the first one

                                if(tokens.length >= 2){

                                    currentLineData.append(key).append("=");

                                    for(int i=1;i<tokens.length;i++){

                                        currentLineData.append(tokens[i].trim()).append("|");

                                    }

                                    currentLineData.append("#");

                                }

                            }else  if( key.equals("reviews")){

                                readingreview = true;

                            }

                        }

                }

            }

        }

        return false;

    }

    @Override

    public Text getCurrentKey() throws IOException, InterruptedException {

        return key;

    }

    @Override

    public Text getCurrentValue() throws IOException, InterruptedException {

        return value;

    }

    @Override

    public float getProgress() throws IOException, InterruptedException {

        return count;

    }

    @Override

    public void close() throws IOException {

        reader.close();

    }

}

Map和Reduce

代码Map中有对于Amazon元数据的方法，就不给出了。就是对input传入的value数据进行解析

package ren.snail;

import java.io.IOException;

import java.text.SimpleDateFormat;

import java.util.List;

import java.util.Set;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hdfs.tools.GetConf;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import ren.snail.AmazonCustomer.ItemData;

/**

 * Find number of owner and replies received by each thread

 * @author Srinath Perera (hemapani@apache.org)

 */

public class Main extends Configured implements Tool {

    public static SimpleDateFormat dateFormatter = new SimpleDateFormat("EEEE dd MMM yyyy hh:mm:ss z");

    public static class AMapper extends Mapper<Object, Text, Text, Text> {

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

    //System.out.println(key + "="+ value);

    try {

    List<AmazonCustomer> customerList = AmazonCustomer.parseAItemLine(value.toString());

    for(AmazonCustomer customer: customerList){

        context.write(new Text(customer.customerID), new Text(customer.toString()));

        //System.out.println(customer.customerID + "=" + customer.toString());

    }

    } catch (Exception e) {

        e.printStackTrace();

        System.out.println("Error:" +e.getMessage());

    }

}

    }

    public static class AReducer extends Reducer<Text, Text, IntWritable, Text> {

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

            AmazonCustomer  customer = new AmazonCustomer();

            customer.customerID = key.toString(); 

            for(Text value: values){

                Set<ItemData> itemsBrought = new AmazonCustomer(value.toString()).itemsBrought;

                for(ItemData itemData: itemsBrought){

                    customer.itemsBrought.add(itemData);

                }

            }

//            if(customer.itemsBrought.size() > 5){

                context.write(new IntWritable(customer.itemsBrought.size()), new Text(customer.toString()));

//            }

        }

    }

    public static void main(String[] args) throws Exception {

        int result = ToolRunner.run(new Configuration(), new Main(), args);

        System.exit(result);

    }

    @Override

    public int run(String[] arg0) throws Exception {

        // TODO Auto-generated method stub

        Configuration configuration = getConf();

        Job job = new Job(configuration, "MostFrequentUserFinder");

        job.setJarByClass(Main.class);

        job.setMapperClass(AMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(IntWritable.class);

        job.setOutputValueClass(Text.class);

        // Uncomment this to

        // job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(AReducer.class);

        job.setInputFormatClass(TestAmazonDataFormat.class);

        FileInputFormat.addInputPath(job, new Path(arg0[0]));

        FileOutputFormat.setOutputPath(job, new Path(arg0[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

        return 0;

    }

}

最终的输出如下：

customerID=A11NCO6YTE4BTJ,review=ASIN=0738700797#title=Candlemas: Feast of Flames#salesrank=168596#group=Book#rating=5#similar=0738700827|1567184960|1567182836|0738700525|0738700940|,

MapReduce实例-基于内容的推荐（一）的更多相关文章

Recommender Systems基于内容的推荐
基于内容的推荐的基本推荐思路是:用户喜欢幻想小说,这本书是幻想小说,则用户有可能喜欢这本小说两方面要求:(1)知道用户的喜好:(2)知道物品的属性基于内容的推荐相比协同过滤方法(个人观点):协同过 ...
新闻推荐系统：基于内容的推荐算法（Recommender System：Content-based Recommendation）
https://blog.csdn.net/qq_32690999/article/details/77434381 因为开发了一个新闻推荐系统的模块,在推荐算法这一块涉及到了基于内容的推荐算法(Co ...
elasticsearch使用More like this实现基于内容的推荐
基于内容的推荐通常是给定一篇文档信息,然后给用户推荐与该文档相识的文档.Lucene的api中有实现查询文章相似度的接口,叫MoreLikeThis.Elasticsearch封装了该接口,通过Ela ...
推荐系统第5周--- 基于内容的推荐，隐语义模型LFM
基于内容的推荐
ElasticSearch java API-使用More like this实现基于内容的推荐
ElasticSearch java API-使用More like this实现基于内容的推荐基于内容的推荐通常是给定一篇文档信息,然后给用户推荐与该文档相识的文档.Lucene的api中有实现查 ...
【T-BABY 夜谈大数据】基于内容的推荐算法
这个系列主要也是自己最近在研究大数据方向,所以边研究.开发也边整理相关的资料.网上的资料经常是碎片式的,如果要完整的看完可能需要同时看好几篇文章,所以我希望有兴趣的人能够更轻松和快速地学习相关的知识. ...
C# 基于内容电影推荐项目（一）
从今天起,我将制作一个电影推荐项目,在此写下博客,记录每天的成果. 其实,从我发布 C# 爬取猫眼电影数据这篇博客后, 我就已经开始制作电影推荐项目了,今天写下这篇博客,也是因为项目进度已经完成50 ...
基于内容的推荐 java实现
这是本人在cousera上学习机器学习的笔记,不能保证其正确性,慎重參考看完这一课后Content Based Recommendations 后自己用java实现了一下 1.下图是待处理的数据,代 ...
Recommending music on Spotify with deep learning 采用深度学习算法为Spotify做基于内容的音乐推荐
本文参考http://blog.csdn.net/zdy0_2004/article/details/43896015译文以及原文file:///F:/%E6%9C%BA%E5%99%A8%E5%AD ...

随机推荐

Nancy 学习-进阶部分继续跨平台
前面两篇,讲解Nancy的基础,及Nancy自宿主和视图引擎. 现在来学习一些进阶部分. Bootstrapper Bootstrapper 就相当于 asp.net 的Global.asax . 我 ...
JAVA获取CLASSPATH路径
ClassLoader 提供了两个方法用于从装载的类路径中取得资源: public URL getResource (String name); public InputStream getResou ...
hibernate----component-entity （人-地址-学校）
package com.ij34.dao; import javax.persistence.*; @Entity @Table(name="school_inf") public ...
修正 ColorPanel 选色缓慢问题
问题:TColorPanel 在运行时,选取颜色都会重绘,造成选色缓慢. 适用:Delphi XE5 修正:找出 FMX.Colors.pas 档案,并复制到自己的 Project 路径里,找到 TC ...
[moka同学笔记]MySql语句整理
更改单个表中的字段属性 content 以前为text类型的,现在改为longtext ALTER TABLE `notice` CHANGE `content` `content` LONGTEXT ...
oracle/MySQL 中的decode的使用
MySQL decode()的等同实现在Oracle中使用decode方法可以轻松实现代码和值之间的转换,但是在MySQL中该如何实现类似功能呢? MySQL中没有直接的方法可以使用 ...
Web Serveice服务代理类生成及编译
本文链接地址:http://www.cnblogs.com/dengxinglin/p/3334158.html 一.生成代理类对于web service服务和wcf的webservice服务,我们 ...
DataSet的灵活,实体类的方便，DTO的效率：SOD框架的数据容器，打造最适合DDD的ORM框架
引言:DDD的困惑最近,我看到园子里面有位朋友的一篇博客 <领域驱动设计系列(一):为何要领域驱动设计? >文章中有下面一段话,对DDD使用产生的疑问: •没有正确的使用ORM, 导致数 ...
【转】超实用PHP函数总结整理
原文链接:http://www.codeceo.com/article/php-function.html 1.PHP加密解密 PHP加密和解密函数可以用来加密一些有用的字符串存放在数据库里,并且通过 ...
向 Web 开发人员推荐35款 JavaScript 图形图表库
图表是数据图形化的表示,通过形象的图表来展示数据,比如条形图,折线图,饼图等等.可视化图表可以帮助开发者更容易理解复杂的数据,提高生产的效率和 Web 应用和项目的可靠性. 在这篇文章中,我们收集了3 ...

MapReduce实例-基于内容的推荐（一）

MapReduce实例-基于内容的推荐（一）的更多相关文章

随机推荐

热门专题