MapReduce之Reduce Join

一介绍

Reduce Join其主要思想如下：
在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value数据对，对每条数据打一个标签（tag），比如：tag=0表示来自文件File1，tag=2表示来自文件File2。即：map阶段的主要任务是对不同文件中的数据打标签。在reduce阶段，reduce函数获取key相同的来自File1和File2文件的value list，然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积)，即：reduce阶段进行实际的连接操作。

在这个例子中我们假设有两个数据文件如下：
存储客户信息的文件：customers.csv

1,stephaie leung,555-555-5555

2,edward kim,123-456-7890

3,jose madriz,281-330-8004

4,david storkk,408-55-0000

存储订单信息的文件：orders.csv

3,A,12.95,02-Jun-2008

1,B,88.25,20-May-2008

2,C,32.00,30-Nov-2007

3,D,25.02,22-Jan-2009

要求最终的输出结果为：

1,Stephanie Leung,555-555-5555,B,88.25,20-May-2008

2,Edward Kim,123-456-7890,C,32.00,30-Nov-2007

3,Jose Madriz,281-330-8004,A,12.95,02-Jun-2008

3,Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

二代码部分

自定义数据类型：用于对不同文件数据打标签

 package mapreduce.reducejoin;

 import java.io.DataInput;

 import java.io.DataOutput;

 import java.io.IOException;

 import org.apache.hadoop.io.Writable;

 public class DataJoinWritable implements Writable {

     // mark ,customer / order

     private String tag;

     // info

     private String data;

     public DataJoinWritable() {

     }

     public DataJoinWritable(String tag, String data) {

         this.set(tag, data);

     }

     public void set(String tag, String data) {

         this.setTag(tag);

         this.setData(data);

     }

     public String getTag() {

         return tag;

     }

     public void setTag(String tag) {

         this.tag = tag;

     }

     public String getData() {

         return data;

     }

     public void setData(String data) {

         this.data = data;

     }

     public void write(DataOutput out) throws IOException {

         out.writeUTF(this.getTag());

         out.writeUTF(this.getData());

     }

     public void readFields(DataInput in) throws IOException {

         this.setTag(in.readUTF());

         this.setData(in.readUTF());

     }

     @Override

     public int hashCode() {

         final int prime = 31;

         int result = 1;

         result = prime * result + ((data == null) ? 0 : data.hashCode());

         result = prime * result + ((tag == null) ? 0 : tag.hashCode());

         return result;

     }

     @Override

     public boolean equals(Object obj) {

         if (this == obj)

             return true;

         if (obj == null)

             return false;

         if (getClass() != obj.getClass())

             return false;

         DataJoinWritable other = (DataJoinWritable) obj;

         if (data == null) {

             if (other.data != null)

                 return false;

         } else if (!data.equals(other.data))

             return false;

         if (tag == null) {

             if (other.tag != null)

                 return false;

         } else if (!tag.equals(other.tag))

             return false;

         return true;

     }

     @Override

     public String toString() {

         return tag + "," + data;

     }

 }

MapReduce代码部分

 package mapreduce.reducejoin;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 public class DataJoinMapReduce extends Configured implements Tool {

     // step 1: Mapper

     public static class DataJoinMapper extends

             Mapper<LongWritable, Text, LongWritable, DataJoinWritable> {

         // map output key

         private LongWritable mapOutputKey = new LongWritable();

         // map output value

         private DataJoinWritable mapOutputValue = new DataJoinWritable();

         @Override

         public void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         public void map(LongWritable key, Text value, Context context)

                 throws IOException, InterruptedException {

             // line value

             String lineValue = value.toString();

             // split

             String[] vals = lineValue.split(",");

             int length = vals.length;

             if ((3 != length) && (4 != length)) {

                 return;

             }

             // get cid

             Long cid = Long.valueOf(vals[0]);

             // get name

             String name = vals[1];

             // set customer

             if (3 == length) {

                 String phone = vals[2];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("customer", name + "," + phone);

             }

             // set order

             if (4 == length) {

                 String price = vals[2];

                 String date = vals[3];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("order", name + "," + price + "," + date);

             }

             // output

             context.write(mapOutputKey, mapOutputValue);

         }

         @Override

         public void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     // step 2: Reducer

     public static class DataJoinReducer extends

             Reducer<LongWritable, DataJoinWritable, NullWritable, Text> {

         private Text outputValue = new Text();

         @Override

         protected void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         protected void reduce(LongWritable key,

                 Iterable<DataJoinWritable> values, Context context)

                 throws IOException, InterruptedException {

             String customerInfo = null;

             List<String> orderList = new ArrayList<String>();

             for (DataJoinWritable value : values) {

                 if ("customer".equals(value.getTag())) {

                     customerInfo = value.getData();

                 } else if ("order".equals(value.getTag())) {

                     orderList.add(value.getData());

                 }

             }

             // output

             for (String order : orderList) {

                 // ser outout value

                 outputValue.set(key.get() + "," + customerInfo + "," + order);

                 // output

                 context.write(NullWritable.get(), outputValue);

             }

         }

         @Override

         protected void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     /**

      * Execute the command with the given arguments.

      *

      * @param args

      *            command specific arguments.

      * @return exit code.

      * @throws Exception

      */

     // step 3: Driver

     public int run(String[] args) throws Exception {

         Configuration configuration = this.getConf();

         // set job

         Job job = Job.getInstance(configuration, this.getClass().getSimpleName());

         job.setJarByClass(DataJoinMapReduce.class);

         // input

         Path inpath = new Path(args[0]);

         FileInputFormat.addInputPath(job, inpath);

         // output

         Path outPath = new Path(args[1]);

         FileOutputFormat.setOutputPath(job, outPath);

         // Mapper

         job.setMapperClass(DataJoinMapper.class);

         job.setMapOutputKeyClass(LongWritable.class);

         job.setMapOutputValueClass(DataJoinWritable.class);

         // Reducer

         job.setReducerClass(DataJoinReducer.class);

         job.setOutputKeyClass(NullWritable.class);

         job.setOutputValueClass(Text.class);

         // submit job -> YARN

         boolean isSuccess = job.waitForCompletion(true);

         return isSuccess ? 0 : 1;

     }

     public static void main(String[] args) throws Exception {

         Configuration configuration = new Configuration();

         args = new String[] {

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/input/reducejoin",

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/output" };

         // run job

         int status = ToolRunner.run(configuration, new DataJoinMapReduce(),

                 args);

         // exit program

         System.exit(status);

     }

 }

执行代码后查询结果

[hadoop@beifeng01 hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -text /user/beifeng01/mapreduce/output/p*

1,stephaie leung,555-555-5555,B,88.25,20-May-2008

2,edward kim,123-456-7890,C,32.00,30-Nov-2007

3,jose madriz,281-330-8004,D,25.02,22-Jan-2009

3,jose madriz,281-330-8004,A,12.95,02-Jun-2008

MapReduce之Reduce Join的更多相关文章

Hadoop学习之路（二十一）MapReduce实现Reduce Join（多个文件联合查询）
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce编程之Reduce Join多种应用场景与使用
在关系型数据库中 Join 是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求, 例如在数据分析时需要连接从不同的数据源中获取到数据.不同于传统的单机模式 ...
MapReduce的Reduce side Join
1. 简单介绍 reduce side join是全部join中用时最长的一种join,可是这样的方法可以适用内连接.left外连接.right外连接.full外连接和反连接等全部的join方式.r ...
MapReduce实现的Join
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce三种join实例分析
本文引自吴超博客实现原理 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式,其具体的实现原理如下: Map端的主要工作:为来自不同 ...
MapReduce中的Join
一. MR中的join的两种方式: 1.reduce side join(面试题) reduce side join是一种最简单的join方式,其主要思想如下: 在map阶段,map函数同时读取两个文 ...
MapReduce之Map Join
一介绍之所以存在Reduce Join,是因为在map阶段不能获取所有需要的join字段,即:同一个key对应的字段可能位于不同map中.Reduce side join是非常低效的,因为shuf ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
mapreduce作业reduce被大量kill掉
之前有一段时间.我们的hadoop2.4集群压力非常大.导致提交的job出现大量的reduce被kill掉.同样的job执行时间比在hadoop0.20.203上面长了非常多.这个问题事实上是redu ...

随机推荐

Linux ->> scp命令复制对端机器上的文件/文件夹
scp是secure copy的简写,用于在Linux下从远程机器拷贝文件. 特点: 传输是加密的,稍微影响了一下速度.而相比较rsync,它对于资源的占用还是有优势的. 用法 scp [参数] [原 ...
SQL Server ->> 更改服务器时区对SQL Server Agent服务器的影响
昨天在把服务器的时区从PST改成UTC后,发现Job都不跑了.因为SQL Server Agent记录Job的历史运行时间是不区分时区的,也就是意味着我改后出现了最后一条运行记录比倒数第二条时间还要早 ...
LeetCode-Container With Most Water-zz
先上代码. #include <iostream> #include <vector> #include <algorithm> using namespace s ...
Linux Mint 18.2安装后需要进行的设置
自己的笔记本电脑升级到win10后各种不好用,运行速度慢,开关机时间很长,系统也是经常性的更新,外加发热严重.更改设置和更换驱动都没能解决问题.另外感觉在Linux下能够更加专注,所以索性将主系统更换 ...
QT的组合键
https://www.cnblogs.com/Jace-Lee/p/5859293.html
[转]Android开源项目收藏分享
转自:http://blog.csdn.net/dianyueneo/article/details/40683285 Android开源项目分类汇总如果你也对开源实现库的实现原理感兴趣,欢迎 St ...
Android（java）学习笔记210：Android线程形态之 IntentService
1. IntentService原理 IntentService是一种特殊的Service,既然是Service,使用的时候记得在AndroidManifest清单文件中注册. 并且它是一个抽象类,因 ...
一对一关联关系基于主键映射的异常 IdentifierGenerationException
具体异常:org.hibernate.id.IdentifierGenerationException: attempted to assign id from null one-to-one pro ...
JDBC（4）PreparedStatement
PreparedStatement: 是一个预编译对象是Statement的子接口允许数据库预编译SQL 执行SQL的时候,无需重新传入SQL语句,它们已经编译SQL语句执行SQL语句 :exe ...
富文本使用之wangEditor3
一.介绍: wangEditor —— 轻量级 web 富文本编辑器,配置方便,使用简单.支持 IE10+ 浏览器. 二.使用方式: 直接下载:https://github.com/wangfupen ...

MapReduce之Reduce Join

MapReduce之Reduce Join的更多相关文章

随机推荐

热门专题