Reduce Side Join实现
关于reduce边join,其最重要的是使用MultipleInputs.addInputPath这个api对不同的表使用不同的Map,然后在每个Map里做一下该表的标识,最后到了Reduce端再根据标识区分对应的表!
Reduce Side Join Example
User and comment join
In thisexample, we’ll be using theusers and comments tables from the StackOverflow dataset. Storing data in this matter makessense, as storingrepetitive user data witheach comment is unnecessary. Thiswould also makeupdating user information diffi‐ cult. However,having disjoint data sets posesproblems when it comes to associating a comment with the user who wroteit. Through the use of a reduceside join, thesetwo data sets canbe merged together using the userID as the foreign key. In this example, we’ll perform an inner, outer, and antijoin. The choice of which join to execute is set during job configuration.
Hadoop supportsthe ability to use multipleinput data typesat once, allowingyou to create a mapper classand input formatfor each inputsplit from different data sources. This is extremely helpful, because you don’t have to code logic for two different data inputs in the samemap implementation. In the following example, two mapperclasses are created: one for the user data and one for the comments. Each mapper classoutputs the user ID as the foreignkey, and the entire record as the value along with a single character to flag whichrecord came fromwhat set. Thereducer then copiesall values for eachgroup in memory, keepingtrack of whichrecord came fromwhat data set.The records are then joined togetherand output.
The following descriptions of eachcode section explainthe solution to the problem.
Problem: Given a set of user information and a list of user’s comments, enrich each comment with the information about the userwho created thecomment.
Drivercode.The job configurationis slightly different from the standard configuration due to the user of themultiple input utility. We also set the join type in the jobconfig‐ uration to args[2] so it can be used in the reducer. The relevant piece of the drivercode to use the MultipleInput follows:
...
// Use MultipleInputs to set which inputuses what mapper
// This will keep parsingof each dataset separate froma logical standpoint
// The firsttwo elements of theargs arrayare the two inputs
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class,UserJoinMapper.class);
MultipleInputs.addInputPath(job,newPath(args[1]), TextInputFormat.class, CommentJoinMapper.class);
job.getConfiguration()..set("join.type", args[2]);
...
User mappercode.This mapper parseseach input lineof user dataXML. It grabs theuser ID associated with each record and outputs it along with the entire input value. It prepends the letter A in front of theentire value. This allows the reducer to know which values came from what data set.
public static class UserJoinMapper extendsMapper<Object, Text, Text, Text> {
private Text outkey =newText();
private Text outvalue =newText();
public void map(Object key, Text value, Context context) throwsIOException, InterruptedException {
// Parse the input stringinto a nice map
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());
String userId = parsed.get("Id");
// The foreign join keyis the userID
outkey.set(userId);
// Flag this record for the reducerand then outputoutvalue.set("A" + value.toString()); context.write(outkey, outvalue);
}
}
When you output the value from the map side, the entire record doesn’t have to be sent. This is an opportunity to optimize the join by keepingonly the fields of data you want to join together. It requiresmore pro‐ cessing on the map side, but is worthit in the long run. Also, sincethe foreign key is in the map output key, you don’t need to keep that in the value, either.
Comment mapper code.This mapper parseseach input line of commentXML. Very sim‐ ilar to the UserJoinMapper,it too grabs the user ID associated with each record and outputs it along with the entire inputvalue. The only different here is that the XML attribute UserId representsthe user that posted to comment, where as Id in theuser data set is the user ID. Here, this mapper prepends the letter B in front ofthe entire value.
public static class CommentJoinMapper extends Mapper<Object, Text, Text, Text> {
private Text outkey =newText();
private Text outvalue =newText();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
Map<String, String> parsed = transformXmlToMap(value.toString());
// The foreign join keyis the userID
outkey.set( parsed.get("UserId"));
// Flag this record for the reducerand then outputoutvalue.set("B" + value.toString()); context.write(outkey, outvalue);
}
}
Reducer code.The reducer code iterates through all thevalues of each group and looks atwhat each record is tagged with and then puts the record in one of two lists.After all values are binned in either list, the actual join logic is executedusing the two lists. The join logic differs slightly based on the type of join,but always involves iterating through both lists and writing to the Context object.The type of join is pulled from the job configuration in the setup method. Let’s look at the main reduce method before looking at the join logic.
public static class UserJoinReducer extendsReducer<Text, Text, Text, Text> {
private staticfinal Text EMPTY_TEXT = Text("");
private Text tmp =newText();
private ArrayList<Text> listA =newArrayList<Text>();
private ArrayList<Text> listB =newArrayList<Text>();
private String joinType =null;
public void setup(Context context) {
// Get the type of join fromour configuration
joinType=context.getConfiguration().get("join.type");
}
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Clear ourlists listA.clear(); listB.clear();
// iterate throughall our values,binning each recordbased on what
// it was tagged with.Make sure to remove thetag!
while(values.hasNext()) { tmp=values.next();
if (tmp.charAt(0) == 'A') {
listA.add(new Text(tmp.toString().substring(1)));
} else if (tmp.charAt('0') == 'B') {
listB.add(new Text(tmp.toString().substring(1)));
}
}
// Execute our join logicnow that the lists are filled
executeJoinLogic(context);
}
private void executeJoinLogic(Context context)
throws IOException, InterruptedException {
...
}
The input data types tothe reducer are two Text objects. The input key isthe foreign join key, which in this example is the user’s ID. The input values associated with the foreign key contain one record from the “users” data set tagged with ‘B’, as well as all the comments the user posted tagged with ‘B’. Any type of data formatting you would want toperform should be done here prior to outputting. For simplicity, the raw XML value from the left data set (users)is output as the key and the raw XML value from the rightdata set (comments) is output as the value.
Next, let’s look at each of the join types. First up is an inner join. If both the lists are not empty, simply performtwo nested forloops and joineach of thevalues together.
if (joinType.equalsIgnoreCase("inner")) {
// If both lists are not empty,join A with B
if (!listA.isEmpty() && !listB.isEmpty()) {
for (Text A : listA) {
for(Text B : listB) { context.write(A, B);
}
}
}
}...
For aleft outer join,if the right list is not empty, join A with B.If the right list is empty, outputeach record of A with an empty string.
... else if(joinType.equalsIgnoreCase("leftouter")) {
// For each entry in A,
for (Text A : listA) {
// If list B is not empty,join A andB
if (!listB.isEmpty()) {
for(Text B : listB) { context.write(A, B);
}
}else{
// Else, outputA by itself
context.write(A, EMPTY_TEXT);
}
}
}...
A rightouter join is very similar, except switching from the check for empty elements fromBto A. If the left list is empty, write records from B withan empty output key.
...else if (joinType.equalsIgnoreCase("rightouter")) {
// For each entry in B,
for (Text B : listB) {
// If list A is not empty,join A andB
if (!listA.isEmpty()) {
for(Text A : listA) { context.write(A, B);
}
} else {
// Else, outputB by itself
context.write(EMPTY_TEXT, B);
}
}
}...
A fullouter join is more complex, in that we want to keep allrecords, ensuring thatwe join records whereappropriate. If list A is not empty, then for everyelement inA, join withB whenthe B listis not empty, or output A by itself. IfA isempty, then just output B.
... else if (joinType.equalsIgnoreCase("fullouter")) {
// If list A is not empty
if (!listA.isEmpty()) {
// For each entry in A
for (Text A : listA) {
// If list B is not empty,join A with B
if (!listB.isEmpty()) {
for(Text B : listB) { context.write(A, B);
}
}else {
// Else, outputA by itself
context.write(A, EMPTY_TEXT);
}
}
} else {
// If list A is empty, just output B
for (Text B : listB) { context.write(EMPTY_TEXT, B);
}
}
}...
For anantijoin, if at least one of the lists is empty, output the recordsfrom the non- empty list with an empty Text object.
... else if(joinType.equalsIgnoreCase("anti")) {
// If list A is empty and B is empty or vice versa
if (listA.isEmpty() ^ listB.isEmpty()) {
// Iterate both A and B with null values
// The previous XOR checkwill make sure exactly one of
// these lists is emptyand therefore the list will be skipped
for (Text A : listA) { context.write(A, EMPTY_TEXT);
}
for (Text B : listB) { context.write(EMPTY_TEXT, B);
}
}
Reduce Side Join实现的更多相关文章
- hadoop 多表join:Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
- hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
- Map Reduce Application(Join)
We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...
- MapReduce的Reduce side Join
1. 简单介绍 reduce side join是全部join中用时最长的一种join,可是这样的方法可以适用内连接.left外连接.right外连接.full外连接和反连接等全部的join方式.r ...
- Map/Reduce中Join查询实现
张表,分别较data.txt和info.txt,字段之间以/t划分. data.txt内容如下: 201001 1003 abc 201002 1005 def 201003 ...
- 0 MapReduce实现Reduce Side Join操作
一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表A) 建表语句: create table if not exists m_ys_lab_jointest_a ( ...
- MapReudce中常见join的方案
两表join在业务开发中是经常用到,了解了大数据join的原理,对于开发有很大的好处. 1.reduce side join reduce side join是一种简单的join的方法,具体思想如下: ...
- HIVE: Map Join Vs Common Join, and SMB
HIVE Map Join is nothing but the extended version of Hash Join of SQL Server - just extending Hash ...
- Hadoop的Map侧join
写了关于Hadoop下载地址的Map侧join 和Reduce的join,今天我们就来在看另外一种比较中立的Join. SemiJoin,一般称为半链接,其原理是在Map侧过滤掉了一些不需要join的 ...
随机推荐
- Python版本切换和Pip安装
Python版本切换 现在常用的linux系统中都会默认携带python运行环境,在ubuntu 16.04 和centos 7.3中携带有Python 2.7 和Python3.5两个版本, 默认使 ...
- Linux系统inotify工具安装配置
inotify主要功能 Inotify 是一个 Linux特性,它监控文件系统操作,比如读取.写入和创建.Inotify 反应灵敏,用法非常简单,并且比 cron 任务的繁忙轮询高效得多.学习如何将 ...
- nodejs笔记--基础篇(一)
Sublime Node.js开发环境配置 下载并安装Node.js安装包后再开始配置 1.先安装好Sublime Text 2 2.运行Sublime,菜单上找到Tools ---> Buil ...
- CodeForces - 792C Divide by Three (DP做法)
C. Divide by Three time limit per test: 1 second memory limit per test: 256 megabytes input: standar ...
- P4语法(2) Parser
这里参考学习了: P4语言规范 P4台湾社群 Parser 关于parser 在P4程序中,有着大量的首部(header)和首部实例,但每次只有部分首部实例会对数据包进行操作,而parser会用于生成 ...
- 向redis插入数据时,返回值问题
向redis插入数据时,如果redis没有要插入的key,插入成功之后返回值为1 如果redis有这个key,插入成功之后返回值是0
- 记一次dll强命名冲突事件
一 问题的出现 现在要做一个net分布式平台,平台涉及多个服务之间调用问题,最基础的莫过于sso.由于我们的sso采用了wcf一套私有框架实现,另外一个webapi服务通过接口调用sso服务.由于s ...
- 全面了解 Nginx 到底能做什么
来源:https://www.jianshu.com/p/8bf73d1a758c 前言 本文只针对Nginx在不加载第三方模块的情况能处理哪些事情,由于第三方模块太多所以也介绍不完,当然本文本身也可 ...
- c++读取文件夹及子文件夹数据
这里有两种情况:读取文件夹下所有嵌套的子文件夹里的所有文件 和 读取文件夹下的指定子文件夹(或所有子文件夹里指定的文件名) <ps,里面和file文件有关的结构体类型和方法在 <io.h ...
- VS升级后的配置问题
当vs升级到更新的版本后,运行原来无误的程序会出现一系列问题. 例如:打不开iostream文件,lib文件,系统找不到文件等等 出现这类问题的原因是,编译环境的include path和librar ...