MapReduce的Reduce side Join

1. 简单介绍

reduce side join是全部join中用时最长的一种join，可是这样的方法可以适用内连接、left外连接、right外连接、full外连接和反连接等全部的join方式。reduce side
join不仅能够对小数据进行join，也能够对大数据进行join，可是大数据会占用大量的集群内部网络IO，由于全部数据终于要写入到reduce端进行join。

假设要做join的数据量很大的话。就不得不用reduce join了。

2. 适用场景

-join的两部分数据量很大；

-想要通过一种模式灵活的适用多种join。

3.Reduce side join的架构

3.1 map 阶段

map 阶段首先从数据中提取出join的foreign key作为map输出的key，然后将输入的记录所有作为输出value。输出的value须要依据输入的数据集打上数据集的标签，比方在value的开头加上‘A’‘B’的标签。

3.2 reduce阶段

reduce端对具有相同foreign key的数据进行处理，对具有标签'A'和'B'的数据进行迭代处理，下面分别用伪代码对不同的join的处理进行说明。

-内连接：假设带有标签‘A’和‘B’的数据都存在，遍历并连接这些数据，然后输出

if (!listA.isEmpty() && !listB.isEmpty()) {

     for (Text A : listA) {

          for (Text B : listB) {

          context.write(A, B);

          }

     }

}

-左外连接：右边的数据假设存在就与左边连接，否则将右边的字段都赋null。仅仅输出左边

// For each entry in A,

for (Text A : listA) {

// If list B is not empty, join A and B

     if (!listB.isEmpty()) {

          for (Text B : listB) {

               context.write(A, B);

          }

     } else {

// Else, output A by itself

          context.write(A, EMPTY_TEXT);

     }

}

-右外连接：与左外连接类似。左边为空就将左边赋值null，仅仅输出右边

// For each entry in B,

for (Text B : listB) {

// If list A is not empty, join A and B

     if (!listA.isEmpty()) {

          for (Text A : listA) {

               context.write(A, B);

          }

     } else {

// Else, output B by itself

          context.write(EMPTY_TEXT, B);

     }

}

-全外连接：这个要相对复杂点，首先输出A和B都不为空的。然后输出某一边为空的

// If list A is not empty

if (!listA.isEmpty()) {

// For each entry in A

     for (Text A : listA) {

// If list B is not empty, join A with B

          if (!listB.isEmpty()) {

               for (Text B : listB) {

                    context.write(A, B);

               }

          } else {

          // Else, output A by itself

               context.write(A, EMPTY_TEXT);

          }

     }

} else {

// If list A is empty, just output B

     for (Text B : listB) {

          context.write(EMPTY_TEXT, B);

     }

}

-反连接：输出A和B没有共同foreign key的值

// If list A is empty and B is empty or vice versa

if (listA.isEmpty() ^ listB.isEmpty()) {

// Iterate both A and B with null values

// The previous XOR check will make sure exactly one of

// these lists is empty and therefore the list will be skipped

     for (Text A : listA) {

          context.write(A, EMPTY_TEXT);

     }

     for (Text B : listB) {

          context.write(EMPTY_TEXT, B);

     }

}

4.实例

以下举一个简单的样例，要求可以用reduce side join方式实现以上全部的join。

4.1数据

User 表

---------------------------

username     cityid

--------------------------

 Li lei,       1

Xiao hong,     2

Lily,          3

Lucy,          3

Daive,         4

Jake,          5

Xiao Ming,     6

City表

---------------------------

cityid     cityname

--------------------------

1,     Shanghai

2,     Beijing

3,     Jinan

4,     Guangzhou

7,     Wuhan

8,     Shenzhen

4.2 代码介绍

写两个mapper，一个mapper处理user数据，一个mapper处理city数据。在主函数中调用时用MultipleInputs类加入数据路径，并分别指派两个处理的Mapper。

往configuration中加入參数“join.type”,传给reducer，决定在reduce端採用什么样的join。

具体代码例如以下：

package com.study.hadoop.mapreduce;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Iterator;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReduceJoin {

	//user map

	public static class UserJoinMapper extends Mapper<Object, Text, Text, Text>{

		private Text outKey = new Text();

		private Text outValue = new Text();

		@Override

		protected void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			// TODO Auto-generated method stub

			String line = value.toString();

			String[] items = line.split(",");

			outKey.set(items[1]);

			outValue.set("A"+items[0]);

			context.write(outKey, outValue);

		}

	}

	//city map

	public static class CityJoinMapper extends Mapper<Object, Text, Text, Text>{

		// TODO Auto-generated constructor stub

		private Text outKey = new Text();

		private Text outValue = new Text();

		@Override

		protected void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			// TODO Auto-generated method stub

			String line = value.toString();

			String[] items = line.split(",");

			outKey.set(items[0]);

			outValue.set("B"+items[1]);

			context.write(outKey, outValue);

		}

	}

	public static class JoinReducer extends Reducer<Text, Text, Text, Text>{

		// TODO Auto-generated constructor stub

		//Join type:{inner,leftOuter,rightOuter,fullOuter,anti}

		private String joinType = null;

		private static final Text EMPTY_VALUE = new Text("");

		private List<Text> listA = new ArrayList<Text>();

		private List<Text> listB = new ArrayList<Text>();

		@Override

		protected void setup(Context context)

				throws IOException, InterruptedException {

			// TODO Auto-generated method stub

			//获取join的类型

			joinType = context.getConfiguration().get("join.type");

		}

		@Override

		protected void reduce(Text key, Iterable<Text> values,Context context)

				throws IOException, InterruptedException {

			// TODO Auto-generated method stub

			listA.clear();

			listB.clear();

			Iterator<Text> iterator = values.iterator();

			while(iterator.hasNext()){

				String value = iterator.next().toString();

				if(value.charAt(0)=='A')

					listA.add(new Text(value.substring(1)));

				if(value.charAt(0)=='B')

					listB.add(new Text(value.substring(1)));

			}

			joinAndWrite(context);

		}

		private void joinAndWrite(Context context)

				throws IOException, InterruptedException{

			//inner join

			if(joinType.equalsIgnoreCase("inner")){

				if(!listA.isEmpty() && !listB.isEmpty()) {

					for (Text A : listA)

						for(Text B : listB){

							context.write(A, B);

						}

				}

			}

			//left outer join

			if(joinType.equalsIgnoreCase("leftouter")){

				if(!listA.isEmpty()){

					for (Text A : listA){

						if(!listB.isEmpty()){

							for(Text B: listB){

								context.write(A, B);

							}

						}

						else{

							context.write(A, EMPTY_VALUE);

						}

					}

				}

			}

			//right outer join

			else if(joinType.equalsIgnoreCase("rightouter")){

				if(!listB.isEmpty()){

					for(Text B: listB){

						if(!listA.isEmpty()){

							for(Text A: listA)

								context.write(A, B);

						}else {

							context.write(EMPTY_VALUE, B);

						}

					}

				}

			}

			//full outer join

			else if(joinType.equalsIgnoreCase("fullouter")){

				if(!listA.isEmpty()){

					for (Text A : listA){

						if(!listB.isEmpty()){

							for(Text B : listB){

								context.write(A, B);

							}

						}else {

							context.write(A, EMPTY_VALUE);

						}

					}

				}else{

					for(Text B : listB)

						context.write(EMPTY_VALUE, B);

				}

			}

			//anti join

			else if(joinType.equalsIgnoreCase("anti")){

				if(listA.isEmpty() ^ listB.isEmpty()){

					for(Text A : listA)

						context.write(A, EMPTY_VALUE);

					for(Text B : listB)

						context.write(EMPTY_VALUE, B);

				}

			}

		}

	}

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		// TODO Auto-generated method stub

		Configuration conf = new Configuration();

		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

	    if (otherArgs.length < 4)

	    {

	      System.err.println("params:<UserInDir> <CityInDir> <OutDir> <join Type>");

	      System.exit(1);

	    }

	    Job job = new Job(conf,"Reduce side join Job");

	    job.setJarByClass(ReduceJoin.class);

	    job.setReducerClass(JoinReducer.class);

	    job.setOutputKeyClass(Text.class);

	    job.setOutputValueClass(Text.class);

	    MultipleInputs.addInputPath(job, new Path(otherArgs[0]), TextInputFormat.class, UserJoinMapper.class);

	    MultipleInputs.addInputPath(job, new Path(otherArgs[1]), TextInputFormat.class, CityJoinMapper.class);

	    FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));

	    job.getConfiguration().set("join.type", otherArgs[3]);

	    System.exit(job.waitForCompletion(true) ? 0 : 1);

	}

}

4.3 结果

运行语句：

inner join：

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvY2hhb2xvdmVqaWE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">

left outer join：

right outer join：

full outer join：

anti join：

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvY2hhb2xvdmVqaWE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">

MapReduce的Reduce side Join的更多相关文章

0 MapReduce实现Reduce Side Join操作
一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表A) 建表语句: create table if not exists m_ys_lab_jointest_a ( ...
hadoop 多表join：Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
MapReduce编程之Semi Join多种应用场景与使用
Map Join 实现方式一 ● 使用场景:一个大表(整张表内存放不下,但表中的key内存放得下),一个超大表 ● 实现方式:分布式缓存 ● 用法: SemiJoin就是所谓的半连接,其实仔细一看就是 ...
Map Reduce Application(Join)
We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...
mapreduce作业reduce被大量kill掉
之前有一段时间.我们的hadoop2.4集群压力非常大.导致提交的job出现大量的reduce被kill掉.同样的job执行时间比在hadoop0.20.203上面长了非常多.这个问题事实上是redu ...
Reduce Side Join实现
关于reduce边join,其最重要的是使用MultipleInputs.addInputPath这个api对不同的表使用不同的Map,然后在每个Map里做一下该表的标识,最后到了Reduce端再根据 ...
Yarn源码分析之参数mapreduce.job.reduce.slowstart.completedmaps介绍
mapreduce.job.reduce.slowstart.completedmaps是MapReduce编程模型中的一个参数,这个参数的含义是,当Map Task完成的比例达到该值后才会为Redu ...
mapreduce中reduce没有执行
hadoop执行mapreduce过程reduce不执行原因 1.如果你的map过程中没有context.write()是不执行reduce过程的:2.如果你的map过程中context.write( ...

随机推荐

第一节、ES6的开发环境搭建
https://blog.csdn.net/zls986992484/article/details/70819462 下面这个不好使 https://blog.csdn.net/gao5311624 ...
CAD嵌套打印（com接口版）
当用户需要打印两个CAD控件的图纸时,可以采用嵌套打印实现.实现嵌套打印功能,首先将两个CAD控件放入网页中,C#代码如下: private void BatchPrintDialog() { MxD ...
富文本编辑器复制Wod字体问题
目前常用的富文本编辑器:百度版UEditor,wangEditor,ckeditor,kindeditor,TinyMCE.当Word复制文本粘贴到编辑器时,几乎都无法保证字体大小完全一致的问题. 想 ...
学习React从接受JSX开始
详情参考官方JSX规范虽然JSX是扩展到ECMAScript的类XML语法,但是它本身并没有定义任何语义.也就是说它本身不在ECMAScript标准范围之内.它也不会被引擎或者浏览器直接执行.通常会 ...
jquery.guide.js 新手指引
/*! * by xyb * 新版上线时候的黑色半透明镂空遮罩指引效果实现jQuery小插件 * 兼容到IE8+ * MIT使用协议,使用时候保留版权 * */ $.guide = function ...
myBatis查询报错 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near
myBatis查询报错 You have an error in your SQL syntax; check the manual that corresponds to your MySQL se ...
LeetCode（4）Median of Two Sorted Arrays
题目 There are two sorted arrays nums1 and nums2 of size m and n respectively. Find the median of the ...
hdu 4948 Kingdom(推论)
hdu 4948 Kingdom(推论) 传送门题意: 题目问从一个城市u到一个新的城市v的必要条件是存在以下两种路径之一 u --> v u --> w -->v 询问任意一种 ...
CodeForcesGym 100517H Hentium Scheduling
Hentium Scheduling Time Limit: 2000ms Memory Limit: 262144KB This problem will be judged on CodeForc ...
xtu summer individual 6 B - Number Busters
Number Busters Time Limit: 1000ms Memory Limit: 262144KB This problem will be judged on CodeForces. ...

MapReduce的Reduce side Join

MapReduce的Reduce side Join的更多相关文章

随机推荐

热门专题