MapReduce Join

对两份数据data1和data2进行关键词连接是一个很通用的问题，如果数据量比较小，可以在内存中完成连接。

如果数据量比较大，在内存进行连接操会发生OOM。mapreduce join可以用来解决大数据的连接。

1 思路

1.1 reduce join

在map阶段, 把关键字作为key输出，并在value中标记出数据是来自data1还是data2。因为在shuffle阶段已经自然按key分组，reduce阶段，判断每一个value是来自data1还是data2,在内部分成2组，做集合的乘积。

这种方法有2个问题：

1, map阶段没有对数据瘦身，shuffle的网络传输和排序性能很低。

2, reduce端对2个集合做乘积计算，很耗内存，容易导致OOM。

1.2 map join

两份数据中，如果有一份数据比较小，小数据全部加载到内存，按关键字建立索引。大数据文件作为map的输入文件，对map()函数每一对输入，都能够方便地和已加载到内存的小数据进行连接。把连接结果按key输出，经过shuffle阶段，reduce端得到的就是已经按key分组的，并且连接好了的数据。

这种方法，要使用hadoop中的DistributedCache把小数据分布到各个计算节点，每个map节点都要把小数据库加载到内存，按关键字建立索引。

这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

1.3 使用内存服务器，扩大节点的内存空间

针对map join，可以把一份数据存放到专门的内存服务器，在map()方法中，对每一个<key,value>的输入对，根据key到内存服务器中取出数据，进行连接

1.4 使用BloomFilter过滤空连接的数据

对其中一份数据在内存中建立BloomFilter，另外一份数据在连接之前，用BloomFilter判断它的key是否存在，如果不存在，那这个记录是空连接，可以忽略。

1.5 使用mapreduce专为join设计的包

在mapreduce包里看到有专门为join设计的包，对这些包还没有学习，不知道怎么使用，只是在这里记录下来，作个提醒。

jar： mapreduce-client-core.jar

package： org.apache.hadoop.mapreduce.lib.join

2 实现map join

相对而言，map join更加普遍，下面的代码使用DistributedCache实现map join

2.1 背景

有客户数据customer和订单数据orders。

customer

客户编号	姓名	地址	电话
1	hanmeimei	ShangHai	110
2	leilei	BeiJing	112
3	lucy	GuangZhou	119

** order**

订单编号	客户编号	其它字段被忽略
1	1	50
2	1	200
3	3	15
4	3	350
5	3	58
6	1	42
7	1	352
8	2	1135
9	2	400
10	2	2000
11	2	300

要求对customer和orders按照客户编号进行连接，结果要求对客户编号分组，对订单编号排序，对其它字段不作要求

客户编号	订单编号	订单金额	姓名	地址	电话
1	1	50	hanmeimei	ShangHai	110
1	2	200	hanmeimei	ShangHai	110
1	6	42	hanmeimei	ShangHai	110
1	7	352	hanmeimei	ShangHai	110
2	8	1135	leilei	BeiJing	112
2	9	400	leilei	BeiJing	112
2	10	2000	leilei	BeiJing	112
2	11	300	leilei	BeiJing	112
3	3	15	lucy	GuangZhou	119
3	4	350	lucy	GuangZhou	119
3	5	58	lucy	GuangZhou	119

在提交job的时候，把小数据通过DistributedCache分发到各个节点。
map端使用DistributedCache读到数据，在内存中构建映射关系--如果使用专门的内存服务器，就把数据加载到内存服务器，map()节点可以只保留一份小缓存；如果使用BloomFilter来加速，在这里就可以构建；
map()函数中，对每一对<key,value>，根据key到第2)步构建的映射里面中找出数据，进行连接，输出。

2.2 程序实现

public class Join extends Configured implements Tool {

	// customer文件在hdfs上的位置。

	// TODO: 改用参数传入

	private static final String CUSTOMER_CACHE_URL = "hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt";

	private static class CustomerBean {

		private int custId;

		private String name;

		private String address;

		private String phone;

		public CustomerBean() {}

		public CustomerBean(int custId, String name, String address,

				String phone) {

			super();

			this.custId = custId;

			this.name = name;

			this.address = address;

			this.phone = phone;

		}

		public int getCustId() {

			return custId;

		}

		public String getName() {

			return name;

		}

		public String getAddress() {

			return address;

		}

		public String getPhone() {

			return phone;

		}

	}

	private static class CustOrderMapOutKey implements WritableComparable<CustOrderMapOutKey> {

		private int custId;

		private int orderId;

		public void set(int custId, int orderId) {

			this.custId = custId;

			this.orderId = orderId;

		}

		public int getCustId() {

			return custId;

		}

		public int getOrderId() {

			return orderId;

		}

		@Override

		public void write(DataOutput out) throws IOException {

			out.writeInt(custId);

			out.writeInt(orderId);

		}

		@Override

		public void readFields(DataInput in) throws IOException {

			custId = in.readInt();

			orderId = in.readInt();

		}

		@Override

		public int compareTo(CustOrderMapOutKey o) {

			int res = Integer.compare(custId, o.custId);

			return res == 0 ? Integer.compare(orderId, o.orderId) : res;

		}

		@Override

		public boolean equals(Object obj) {

			if (obj instanceof CustOrderMapOutKey) {

				CustOrderMapOutKey o = (CustOrderMapOutKey)obj;

				return custId == o.custId && orderId == o.orderId;

			} else {

				return false;

			}

		}

		@Override

		public String toString() {

			return custId + "\t" + orderId;

		}

	}

	private static class JoinMapper extends Mapper<LongWritable, Text, CustOrderMapOutKey, Text> {

		private final CustOrderMapOutKey outputKey = new CustOrderMapOutKey();

		private final Text outputValue = new Text();

		/**

		 * 在内存中customer数据

		 */

		private static final Map<Integer, CustomerBean> CUSTOMER_MAP = new HashMap<Integer, Join.CustomerBean>();

		@Override

		protected void map(LongWritable key, Text value, Context context)

				throws IOException, InterruptedException {

			// 格式: 订单编号	客户编号	订单金额

			String[] cols = value.toString().split("\t");

			if (cols.length < 3) {

				return;

			}

			int custId = Integer.parseInt(cols[1]);		// 取出客户编号

			CustomerBean customerBean = CUSTOMER_MAP.get(custId);

			if (customerBean == null) {			// 没有对应的customer信息可以连接

				return;

			}

			StringBuffer sb = new StringBuffer();

			sb.append(cols[2])

				.append("\t")

				.append(customerBean.getName())

				.append("\t")

				.append(customerBean.getAddress())

				.append("\t")

				.append(customerBean.getPhone());

			outputValue.set(sb.toString());

			outputKey.set(custId, Integer.parseInt(cols[0]));

			context.write(outputKey, outputValue);

		}

		@Override

		protected void setup(Context context)

				throws IOException, InterruptedException {

			FileSystem fs = FileSystem.get(URI.create(CUSTOMER_CACHE_URL), context.getConfiguration());

			FSDataInputStream fdis = fs.open(new Path(CUSTOMER_CACHE_URL));

			BufferedReader reader = new BufferedReader(new InputStreamReader(fdis));

			String line = null;

			String[] cols = null;

			// 格式：客户编号	姓名	地址	电话

			while ((line = reader.readLine()) != null) {

				cols = line.split("\t");

				if (cols.length < 4) {				// 数据格式不匹配，忽略

					continue;

				}

				CustomerBean bean = new CustomerBean(Integer.parseInt(cols[0]), cols[1], cols[2], cols[3]);

				CUSTOMER_MAP.put(bean.getCustId(), bean);

			}

		}

	}

	/**

	 * reduce

	 * @author Ivan

	 *

	 */

	private static class JoinReducer extends Reducer<CustOrderMapOutKey, Text, CustOrderMapOutKey, Text> {

		@Override

		protected void reduce(CustOrderMapOutKey key, Iterable<Text> values, Context context)

				throws IOException, InterruptedException {

			// 什么事都不用做，直接输出

			for (Text value : values) {

				context.write(key, value);

			}

		}

	}

	/**

	 * @param args

	 * @throws Exception

	 */

	public static void main(String[] args) throws Exception {

		if (args.length < 2) {

			new IllegalArgumentException("Usage: <inpath> <outpath>");

			return;

		}

		ToolRunner.run(new Configuration(), new Join(), args);

	}

	@Override

	public int run(String[] args) throws Exception {

		Configuration conf = getConf();

		Job job = Job.getInstance(conf, Join.class.getSimpleName());

		job.setJarByClass(SecondarySortMapReduce.class);

		// 添加customer cache文件

		job.addCacheFile(URI.create(CUSTOMER_CACHE_URL));

		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		// map settings

		job.setMapperClass(JoinMapper.class);

		job.setMapOutputKeyClass(CustOrderMapOutKey.class);

		job.setMapOutputValueClass(Text.class);

		// reduce settings

		job.setReducerClass(JoinReducer.class);

		job.setOutputKeyClass(CustOrderMapOutKey.class);

		job.setOutputKeyClass(Text.class);

		boolean res = job.waitForCompletion(true);

		return res ? 0 : 1;

	}

}

运行环境

操作系统: Centos 6.4
Hadoop: Apache Hadoop-2.5.0

客户数据文件在hdfs上的位置硬编码为

hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt，运行程序之前先把客户数据上传到这个位置。

程序运行结果

@Hadoop中两表JOIN的处理方法

MapReduce实现的Join的更多相关文章

MapReduce中的Join
一. MR中的join的两种方式: 1.reduce side join(面试题) reduce side join是一种最简单的join方式,其主要思想如下: 在map阶段,map函数同时读取两个文 ...
Hadoop学习之路（二十一）MapReduce实现Reduce Join（多个文件联合查询）
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
MapReduce 实现数据join操作
前段时间有一个业务需求,要在外网商品(TOPB2C)信息中加入联营自营识别的字段.但存在的一个问题是,商品信息和自营联营标示数据是两份数据:商品信息较大,是存放在hbase中.他们之前唯一的 ...
MapReduce中的Join算法
在关系型数据库中Join是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求,例如在数据分析时需要从不同的数据源中获取数据.不同于传统的单机模式,在分布式存 ...
MapReduce三种join实例分析
本文引自吴超博客实现原理 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式,其具体的实现原理如下: Map端的主要工作:为来自不同 ...
大数据mapreduce俩表join之python实现
二次排序在Hadoop中,默认情况下是按照key进行排序,如果要按照value进行排序怎么办?即:对于同一个key,reduce函数接收到的value list是按照value排序的.这种应用需求在 ...
MapReduce之Map Join
一介绍之所以存在Reduce Join,是因为在map阶段不能获取所有需要的join字段,即:同一个key对应的字段可能位于不同map中.Reduce side join是非常低效的,因为shuf ...
MapReduce之Reduce Join
一介绍 Reduce Join其主要思想如下: 在map阶段,map函数同时读取两个文件File1和File2,为了区分两种来源的key/value数据对,对每条数据打一个标签(tag), 比如:t ...

随机推荐

BeanDefinitionStoreException: Failed to read candidate component class: URL
如题,遇到这种情况一般都是引用jar包版本不一致或者编译后的class除了问题解决办法: a.如果是maven项目,把项目全部clean一下,重新mvn install b.如果不是maven项目, ...
jqGrid(struts2+jdbc+jsp)增删改查的例子
前几日一直在找关于Java操作jqgrid返回json的例子,在网上也看了不少东西,结果都没几个合理的,于是本人结合网上的零散数据进行整理,完成了一个比较完整的jqgrid小例子,考虑到还有很多 ...
C/C++笔试经典程序（二）
1.下面5个函数哪个能够成功进行两个数的交换? swap1传的是值的副本,在函数体内被修改了形参p.q(实际参数a.b的一个拷贝),p.q的值确实交换了,但是它们是局部变量,不会影响到主函数中的a和b ...
#define的一些
// 生成一个字符串 #define NSString(...) [NSString stringWithFormat:__VA_ARGS__]
安装python的redis模块
wget --no-check-certificate https://pypi.python.org/packages/source/r/redis/redis-2.8.0.tar.gz tar - ...
洛谷P1518 两只塔姆沃斯牛 The Tamworth Two
P1518 两只塔姆沃斯牛 The Tamworth Two 109通过 184提交题目提供者该用户不存在标签USACO 难度普及+/提高提交讨论题解最新讨论求数据题目背景题目描 ...
javascript 同步加载与异步加载
HTML 4.01 的script属性 charset: 可选.指定src引入代码的字符集,大多数浏览器忽略该值. defer: boolean, 可选.延迟脚本执行,相当于将script标签放入页面 ...
嵌入在C++程序中的extern "C"
1.extern的作用 extern是C/C++语言中表明函数和全局变量作用范围(可见性)的关键字,可以告知编译器,用extern声明的函数和变量可以在本模块或其它模块中使用. 通常,在模块的头文件中 ...
JS常用的设计模式(13)——组合模式
组合模式又叫部分-整体模式,它将所有对象组合成树形结构.使得用户只需要操作最上层的接口,就可以对所有成员做相同的操作. 一个再好不过的例子就是jquery对象,大家都知道1个jquery对象其实是一组 ...
[python 2.7.5] 实现配置文件的读写
import ConfigParser config = ConfigParser.RawConfigParser() # When adding sections or items, add the ...

MapReduce实现的Join