Map Reduce Application(Join)
We are going to explain how join works in MR , we will focus on reduce side join and map side join.
Reduce Side Join
Assuming we have 2 datasets , one is user information(id, name...) , the other is comments made by users(user id, content, date...). We want to join the 2 datasets to select the username and comment they posted. So, this is a typical join example. You can implement all types of join including innter join/outer join/full outer join... As the name indicates, the join is done in reducer.
- We use 2/n mappers for each dataset(table in RDBMS). So, we set this with code below.
MultipleInputs.addInputPath(job,filePath,TextInputFormat.class,UserMapper.class)
MultipleInputs.addInputPath(job,filePath,TextInputFormat.class,CommentsMapper.class)
3 ....
4 MultipleInputs.addInputPath(job,filePath,TextInputFormat.class,OtherMapper.class)
.... - In each mapper, we just need to output the key/value pairs as the job is most done in reducer. In reduce function, when it iterators the values for a given key, reduce function needs to know the value is from which dataset to perform the join. Reducer itself may not be able to distinguish which value is from which mapper(UserMapper or CommentsMapper) for a given key. So, in the map function, we have a chance to mark the value like prefix the value with the mapper name something like that.
outkey.set(userId);
//mark this value so reduce function knows
outvalue.set("UserMapper"+value.toString);
context.write(outkey,outvalue) - In reducer, we get the join type from configuration, perform the join. there can be multiple reducers and with multiple threads.
public void setup(Context context){
joinType = context.getConfiguration().get("joinType");
}
public void reduce(Text text, Iterable<Text> values, Context context)
throws Exception {
listUser.clear();
listComments.clear();
for (Text t: values){
if(isFromUserMapper(t)){
listUser.add(realContent(t));
}else if (isFromCommentsMapper(t)){
listUser.add(realContent(t));
}
}
doJoin(context);
}
private void doJoin(Context context) throws Exception{
if (joinType.equals("inner")){
if(both are not empty){
for (Text user:listUser){
for (Text comm: listComments){
context.write(user,comm);
}
}
}
}else if (){
}.....
}
In reducer side join, all data will be sent to reducer side, so, the overall network bandwith is required.
Map Side Join/Replicated Join
As the name indicates , the join operation is done in map side . So, there is no reducer. It is very suitable for join datasets which has only 1 large dataset and others are small dataset and can be read into small memory in a single machine. It is faster than reduce side join (as no reduce phase, no intermediate output, no network transfer)
We still use the sample example that is to join user(small) and comments(large) datasets. How to implement it?
- Set the number of reduce to 0.
job.setNumReduceTasks(0);
- Add the small datasets to hadoop distribute cache.The first one is deprecated.
DistributedCache.addCacheFile(new Path(args[]).toUri(),job.getConfiguration)
job.addCacheFile(new Path(filename).toUri());
- In mapper setup function, get the cache by code below. The first one is deprecated. Read the file and put the the key / value in an instance variable like HashMap. This is single thread, so it is safe.
Path[] localPaths = context.getLocalCacheFiles();
URI[] uris = context.getCacheFiles()
- In the mapper function, since, you have the entire user data set in the HashMap, you can try to get the key(comes from the split of comment dataset) from the HashMap. If it exists, you get a match. Because only one split of comments dataset goes into each mapper task, you can only perform an inner join or a left outer join.
What is Hadoop Distributed Cache?
"DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on(or broadcast to) each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code" The cache will be removed once the job is done as they are temporary files.
The size of the cache can be configured in mapred-site.xml.
How to use Distributed Cache(the API has changed)?
- Add cache in driver.
Note the # sign in the URI. Before it, you specify the absolute data path in HDFS. After it, you set a name(symlink) to specify the local file path in your mapper/reducer.
job.addCacheFile(new URI("/user/ricky/user.txt#user"));
job.addCacheFile(new URI("/user/ricky/org.txt#org")); return job.waitForCompletion(true) ? 0 : 1;
- Read cache in your task(mapper/reduce), probably in setup function.
@Override
protected void setup(
Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
if (context.getCacheFiles() != null
&& context.getCacheFiles().length > 0) { File some_file = new File("user");
File other_file = new File("org");
}
super.setup(context);
}
Reference:
https://www.youtube.com/user/pramodnarayana/videos
https://stackoverflow.com/questions/19678412/number-of-mappers-and-reducers-what-it-means
Map Reduce Application(Join)的更多相关文章
- Map Reduce Application(Partitioninig/Binning)
Map Reduce Application(Partitioninig/Group data by a defined key) Assuming we want to group data by ...
- Map Reduce Application(Top 10 IDs base on their value)
Top 10 IDs base on their value First , we need to set the reduce to 1. For each map task, it is not ...
- Map/Reduce中Join查询实现
张表,分别较data.txt和info.txt,字段之间以/t划分. data.txt内容如下: 201001 1003 abc 201002 1005 def 201003 ...
- hadoop 多表join:Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
- hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
- HIVE 的MAP/REDUCE
对于 JOIN 操作: Map: 以 JOIN ON 条件中的列作为 Key,如果有多个列,则 Key 是这些列的组合 以 JOIN 之后所关心的列作为 Value,当有多个列时,Value 是这些列 ...
- mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器
随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...
- 基于python的《Hadoop权威指南》一书中气象数据下载和map reduce化数据处理及其可视化
文档内容: 1:下载<hadoop权威指南>中的气象数据 2:对下载的气象数据归档整理并读取数据 3:对气象数据进行map reduce进行处理 关键词:<Hadoop权威指南> ...
- Reduce Side Join实现
关于reduce边join,其最重要的是使用MultipleInputs.addInputPath这个api对不同的表使用不同的Map,然后在每个Map里做一下该表的标识,最后到了Reduce端再根据 ...
随机推荐
- DQL-排序查询
三:排序查询 语法: select 列名 from 表名 where 筛选条件 order by 需要排序的列名 asc/desc 特点:不写升序还是降序,默认升序 排序列表 可以是 ...
- ABAP术语-RFC (Remote Function Call)
RFC (Remote Function Call) 原文:http://www.cnblogs.com/qiangsheng/archive/2008/03/12/1101581.html RFC ...
- MySQL学习之事务安全
事务安全 事务概念 事务(transaction)是访问并可能更新数据库中各种数据项的一个程序执行单元(unit),事务通常由高级数据操纵语言或编程语言 书写的用户程序的执行所引起.事务有事务开始(b ...
- .Net core 使用Swagger
接触到项目的时候,用了很久的Swagger,发现Swagger真的非常好用,不但方便了调试Web Api,还生成了Api 文档,真是非常的好用啊. 然后我想搞懂到底如何使用Swagger,所以自己建了 ...
- 关于Hibernate基于version的乐观锁
刚刚接触SSH框架,虽然可能这个框架已经比较过时了,但是个人认为,SSH作为一个成熟的框架,作为框架的入门还是可以的. 马马虎虎学完了Hibernate的基础,总结一点心得之类的. 学习Hiberna ...
- JavaScript中Array的正确使用方式
在 JavaScript 中正确使用地使用 Array 的方法如下: 用 Array.includes 代替 Array.indexOf “如果你要在数组中查找元素,请使用 Array.indexOf ...
- Vue清除所有JS定时器
Vue清除所有JS定时器 在webpack + vue 的项目中如何在页面跳转的时候清除所有的定时器 JS定时器会有一个返回值(数字),通过这个返回值我们可以找到这个定时器 在vue项目中可以使用路由 ...
- vue的监听键盘事件的快捷方法
在我们的项目经常需要监听一些键盘事件来触发程序的执行,而Vue中允许在监听的时候添加关键修饰符: <input v-on:keyup.13="submit"> 对于一些 ...
- 史上更全的 MySQL 高性能优化实战总结!
1 前言 2 优化的哲学 3 优化思路 3.1 优化什么 3.2 优化的范围有哪些 3.3 优化维度 4 优化工具有啥? 4.1 数据库层面 4.2 数据库层面问题解决思路 4.3 系统层面 4.4 ...
- thinkphp5 rbac权限
thinkphp 5 rbac权限 一 先创建一个数据库; 例如:创建一个test数据库;然后创建3个 表分别为:test_admin (管理员表), test_role,test_auth. 这个是 ...