结合MapReduce和数据集Combining datasets with MapReduce

While in the SQL-world is very easy combining two or more datasets - we just need to use the JOIN keyword - with MapReduce things becomes a little harder. Let's get into it.
Suppose we have two distinct datasets, one for users of a forum and the other for the posts in the forum (data is in TSV - Tab Separated Values - format).
Users dataset:

id   name  reputation

0102 alice 32

0511 bob   27

...

Posts dataset:

id      type      subject   body                                   userid

0028391 question  test      "Hi, what is.."                        0102

0073626 comment   bug       "Guys, I've found.."                   0511

0089234 comment   bug       "Nope, it's not that way.."            0734

0190347 answer    info      "In my opinion it's worth the time.."  1932

...

What we'd like to do is to combine the reputation of each user to the number of question he/she posted, to see if we can relate one to the other.

The main idea behind combining the two datasets is to leverage the shuffle and sort phase: this process groups together values with the same key, so if we define the user id as the key, we can send to the reducer both the user reputation and the number of his/her posts, because they're attached to the same key (the user id).
Let's see how.
We start with the mapper:

public static class JoinMapper extends Mapper<object, text,="" intwritable=""> {

        private final static IntWritable one = new IntWritable(1);

        @Override

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            // gets filename of the input file for this record

            FileSplit fileSplit = (FileSplit) context.getInputSplit();

            String filename = fileSplit.getPath().getName();

            // creates an array with all the fields of the row we're reading now

            String[] fields = value.toString().split(("\t"));

            // if we're reading the posts file

            if (filename.equals("forum_nodes_no_lf.tsv")) {

                // retrieves the author ID and the type of the post

                String type = fields[1];

                if (type.equals("question")) {

                     String authorId = fields[4];

                     context.write(new Text(authorId), one);

                }

            }

            // if we're reading the users file

            else {

                String authorId = fields[0];

                String reputation = fields[2];

                // we add two to the reputation, because we want the minimum value to be greater than 1,

                // not to be confused with the "one" passed by the other branch of the if

                int reputationValue = Integer.parseInt(reputation) + 2;

                context.write(new Text(authorId), new IntWritable(reputationValue));

            }

        }

    }

First of all, this code assumes that in the directory Hadoop in looking in for data, there are two files: the users file and the posts file; we use the FileSplit class to obtain which filename Hadoop is now reading: in this way we can know if we're dealing the users file or the posts file. Then, if is the posts file, things get a little trickier. For every user, we're passing to the reducer a "1" for every question he/she posted on the forum; since we want to pass also reputation of the user (that can be a "0" or a "1"), we have to be careful not to mix up the values. To do this, we add 2 to the reputation, so that, even if it is "0", the value passed to the reducer will be greater or equal to two. In this way, we know that when the reducer will receive a "1" it will be for counting a question posted on the forum, while when it will receive a value greater than "1", it will be the reputation of the user.
Let's now look at the reducer:

 public static class JoinReducer extends Reducer<text, intwritable,="" text,="" text=""> {

        @Override

        public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

            int postsNumber = 0;

            int reputation = 0;

            String authorId = key.toString();

            for (IntWritable value : values) {

                int intValue = value.get();

                if (intValue == 1) {

                    postsNumber ++;

                }

                else {

                    // we subtract two for having the exact reputation value (see the mapper)

                    reputation = intValue -2;

                }

            }

            context.write(new Text(authorId), new Text(reputation + "\t" + postsNumber));

        }

    }

As stated before, the reducer will now receive two kinds of data: "1" if related to the number of posts of the user, and a value greater than one for the reputation. The code in in the reducer, checks exactly this: if receives a "1" increaes the number of posts of this user, otherwise sets his/her reputation. At the end of the method, we tell the reducer to output the authorId, his/her reputation and how many posts has posted on the forum:

userid  reputation  posts#

0102    55          23

0511    05          11

0734    00          89

1932    19          32

...

and we're ready to analyze these data.

from: http://andreaiacono.blogspot.com/2014/09/combining-datasets-with-mapreduce.html

结合MapReduce和数据集Combining datasets with MapReduce的更多相关文章

MapReduce编程(一) Intellij Idea配置MapReduce编程环境
介绍怎样在Intellij Idea中通过创建mavenproject配置MapReduce的编程环境. 一.软件环境我使用的软件版本号例如以下: Intellij Idea 2017.1 Mave ...
【Big Data - Hadoop - MapReduce】hadoop 学习笔记：MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
【Big Data - Hadoop - MapReduce】初学Hadoop之图解MapReduce与WordCount示例分析
Hadoop的框架最核心的设计就是:HDFS和MapReduce.HDFS为海量的数据提供了存储,MapReduce则为海量的数据提供了计算. HDFS是Google File System(GFS) ...
第2节 mapreduce深入学习：14、mapreduce数据压缩-使用snappy进行压缩
第2节 mapreduce深入学习:14.mapreduce数据压缩-使用snappy进行压缩文件压缩有两大好处,节约磁盘空间,加速数据在网络和磁盘上的传输. 方式一:在代码中进行设置压缩代码: ...
第2节 mapreduce深入学习：7、MapReduce的规约过程combiner
第2节 mapreduce深入学习:7.MapReduce的规约过程combiner 每一个 map 都可能会产生大量的本地输出,Combiner 的作用就是对 map 端的输出先做一次合并,以减少在 ...
第2节 mapreduce深入学习：6、MapReduce当中的计数器
第2节 mapreduce深入学习:6. MapReduce当中的计数器计数器是收集作业统计信息的有效手段之一,用于质量控制或应用级统计.计数器还可辅助诊断系统故障.如果需要将日志信息传输到map ...
Hadoop MapReduce编程 API入门系列之MapReduce多种输出格式分析（十九）
不多说,直接上代码. 假如这里有一份邮箱数据文件,我们期望统计邮箱出现次数并按照邮箱的类别,将这些邮箱分别输出到不同文件路径下. 代码版本1 package zhouls.bigdata.myMapR ...
Hadoop MapReduce编程 API入门系列之MapReduce多种输入格式（十七）
不多说,直接上代码. 代码 package zhouls.bigdata.myMapReduce.ScoreCount; import java.io.DataInput; import java.i ...
【Hadoop】MapReduce笔记（四）：MapReduce优化策略总结
Cloudera 提供给客户的服务内容之一就是调整和优化MapReduce job执行性能.MapReduce和HDFS组成一个复杂的分布式系统,并且它们运行着各式各样用户的代码,这样导致没有一个快速 ...

随机推荐

【LOJ】#2244. 「NOI2014」起床困难综合症
题解写水题放松一下心情二进制有个很好的性质是每一位是独立的,我们按位贪心,先看这一位能不能填1,然后看看如果这一位填0那么运算后最后这一位是不是1,是的话就退出,然后看看这一位如果填1最后是1这一 ...
USACO 6.1 Postal Vans（一道神奇的dp）
Postal Vans ACM South Pacific Region -- 2003 Tiring of their idyllic fields, the cows have moved to ...
安装caffe框架所需文件
安装caffe框架所需文件: 1.微软提供的快速卷积神经网络框架caffe-master安装包或者windows提供的caffe-windows安装包. 链接:http://pan.baidu.com ...
HDU 4443 带环树形dp
思路:如果只有一棵树这个问题很好解决,dp一次,然后再dfs一次往下压求答案就好啦,带环的话,考虑到环上的点不是很多,可以暴力处理出环上的信息,然后最后一次dfs往下压求答案就好啦.细节比较多. # ...
Java人员正确使用 IntelliJ IDEA的方式
原文: http://tengj.top/2017/02/22/idea1-1/ 作者: 嘟嘟MD 前言博主是Java开发人员,以前一直都用myeclipse来开发的,说实话感觉myeclipse毫 ...
Linux基础命令—网卡
#1.实时查看网卡流量 #sar -n DEV 1 5 [每间隔1秒刷新一次,共5次] sar -n DEV 1 5 IFACE 表示设备名称 rxpck/s 每秒接收的包的数量 txpck/s 每秒 ...
hdu-3790最短路刷题
title: hdu-3790最短路刷题 date: 2018-10-20 14:50:31 tags: acm 刷题 categories: ACM-最短路概述一道最短路的水题,,,尽量不看以前 ...
@NamedEntityGraphs --JPA按实体类对象参数中的字段排序问题得解决方法
JPA按实体类对象参数中的字段排序问题得解决方法@Entity @Table(name="complaints") @NamedEntityGraphs({ @NamedEntit ...
机器学习之路：tensorflow 深度学习中分类问题的损失函数交叉熵
经典的损失函数----交叉熵 1 交叉熵: 分类问题中使用比较广泛的一种损失函数, 它刻画两个概率分布之间的距离给定两个概率分布p和q, 交叉熵为: H(p, q) = -∑ p(x) log q( ...
gradle/maven/eclipse工程相互转化
原文: gradle/maven/eclipse工程相互转化 gradle/maven/eclipse工程相互转化:前提安装好相应的工具和插件.1.Maven->eclipse mvn ecl ...

结合MapReduce和数据集Combining datasets with MapReduce

结合MapReduce和数据集Combining datasets with MapReduce的更多相关文章

随机推荐

热门专题