MapReduce之Reduce Join

一介绍

Reduce Join其主要思想如下：
在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value数据对，对每条数据打一个标签（tag），比如：tag=0表示来自文件File1，tag=2表示来自文件File2。即：map阶段的主要任务是对不同文件中的数据打标签。在reduce阶段，reduce函数获取key相同的来自File1和File2文件的value list，然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积)，即：reduce阶段进行实际的连接操作。

在这个例子中我们假设有两个数据文件如下：
存储客户信息的文件：customers.csv

1,stephaie leung,555-555-5555

2,edward kim,123-456-7890

3,jose madriz,281-330-8004

4,david storkk,408-55-0000

存储订单信息的文件：orders.csv

3,A,12.95,02-Jun-2008

1,B,88.25,20-May-2008

2,C,32.00,30-Nov-2007

3,D,25.02,22-Jan-2009

要求最终的输出结果为：

1,Stephanie Leung,555-555-5555,B,88.25,20-May-2008

2,Edward Kim,123-456-7890,C,32.00,30-Nov-2007

3,Jose Madriz,281-330-8004,A,12.95,02-Jun-2008

3,Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

二代码部分

自定义数据类型：用于对不同文件数据打标签

 package mapreduce.reducejoin;

 import java.io.DataInput;

 import java.io.DataOutput;

 import java.io.IOException;

 import org.apache.hadoop.io.Writable;

 public class DataJoinWritable implements Writable {

     // mark ,customer / order

     private String tag;

     // info

     private String data;

     public DataJoinWritable() {

     }

     public DataJoinWritable(String tag, String data) {

         this.set(tag, data);

     }

     public void set(String tag, String data) {

         this.setTag(tag);

         this.setData(data);

     }

     public String getTag() {

         return tag;

     }

     public void setTag(String tag) {

         this.tag = tag;

     }

     public String getData() {

         return data;

     }

     public void setData(String data) {

         this.data = data;

     }

     public void write(DataOutput out) throws IOException {

         out.writeUTF(this.getTag());

         out.writeUTF(this.getData());

     }

     public void readFields(DataInput in) throws IOException {

         this.setTag(in.readUTF());

         this.setData(in.readUTF());

     }

     @Override

     public int hashCode() {

         final int prime = 31;

         int result = 1;

         result = prime * result + ((data == null) ? 0 : data.hashCode());

         result = prime * result + ((tag == null) ? 0 : tag.hashCode());

         return result;

     }

     @Override

     public boolean equals(Object obj) {

         if (this == obj)

             return true;

         if (obj == null)

             return false;

         if (getClass() != obj.getClass())

             return false;

         DataJoinWritable other = (DataJoinWritable) obj;

         if (data == null) {

             if (other.data != null)

                 return false;

         } else if (!data.equals(other.data))

             return false;

         if (tag == null) {

             if (other.tag != null)

                 return false;

         } else if (!tag.equals(other.tag))

             return false;

         return true;

     }

     @Override

     public String toString() {

         return tag + "," + data;

     }

 }

MapReduce代码部分

 package mapreduce.reducejoin;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 public class DataJoinMapReduce extends Configured implements Tool {

     // step 1: Mapper

     public static class DataJoinMapper extends

             Mapper<LongWritable, Text, LongWritable, DataJoinWritable> {

         // map output key

         private LongWritable mapOutputKey = new LongWritable();

         // map output value

         private DataJoinWritable mapOutputValue = new DataJoinWritable();

         @Override

         public void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         public void map(LongWritable key, Text value, Context context)

                 throws IOException, InterruptedException {

             // line value

             String lineValue = value.toString();

             // split

             String[] vals = lineValue.split(",");

             int length = vals.length;

             if ((3 != length) && (4 != length)) {

                 return;

             }

             // get cid

             Long cid = Long.valueOf(vals[0]);

             // get name

             String name = vals[1];

             // set customer

             if (3 == length) {

                 String phone = vals[2];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("customer", name + "," + phone);

             }

             // set order

             if (4 == length) {

                 String price = vals[2];

                 String date = vals[3];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("order", name + "," + price + "," + date);

             }

             // output

             context.write(mapOutputKey, mapOutputValue);

         }

         @Override

         public void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     // step 2: Reducer

     public static class DataJoinReducer extends

             Reducer<LongWritable, DataJoinWritable, NullWritable, Text> {

         private Text outputValue = new Text();

         @Override

         protected void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         protected void reduce(LongWritable key,

                 Iterable<DataJoinWritable> values, Context context)

                 throws IOException, InterruptedException {

             String customerInfo = null;

             List<String> orderList = new ArrayList<String>();

             for (DataJoinWritable value : values) {

                 if ("customer".equals(value.getTag())) {

                     customerInfo = value.getData();

                 } else if ("order".equals(value.getTag())) {

                     orderList.add(value.getData());

                 }

             }

             // output

             for (String order : orderList) {

                 // ser outout value

                 outputValue.set(key.get() + "," + customerInfo + "," + order);

                 // output

                 context.write(NullWritable.get(), outputValue);

             }

         }

         @Override

         protected void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     /**

      * Execute the command with the given arguments.

      *

      * @param args

      *            command specific arguments.

      * @return exit code.

      * @throws Exception

      */

     // step 3: Driver

     public int run(String[] args) throws Exception {

         Configuration configuration = this.getConf();

         // set job

         Job job = Job.getInstance(configuration, this.getClass().getSimpleName());

         job.setJarByClass(DataJoinMapReduce.class);

         // input

         Path inpath = new Path(args[0]);

         FileInputFormat.addInputPath(job, inpath);

         // output

         Path outPath = new Path(args[1]);

         FileOutputFormat.setOutputPath(job, outPath);

         // Mapper

         job.setMapperClass(DataJoinMapper.class);

         job.setMapOutputKeyClass(LongWritable.class);

         job.setMapOutputValueClass(DataJoinWritable.class);

         // Reducer

         job.setReducerClass(DataJoinReducer.class);

         job.setOutputKeyClass(NullWritable.class);

         job.setOutputValueClass(Text.class);

         // submit job -> YARN

         boolean isSuccess = job.waitForCompletion(true);

         return isSuccess ? 0 : 1;

     }

     public static void main(String[] args) throws Exception {

         Configuration configuration = new Configuration();

         args = new String[] {

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/input/reducejoin",

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/output" };

         // run job

         int status = ToolRunner.run(configuration, new DataJoinMapReduce(),

                 args);

         // exit program

         System.exit(status);

     }

 }

执行代码后查询结果

[hadoop@beifeng01 hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -text /user/beifeng01/mapreduce/output/p*

1,stephaie leung,555-555-5555,B,88.25,20-May-2008

2,edward kim,123-456-7890,C,32.00,30-Nov-2007

3,jose madriz,281-330-8004,D,25.02,22-Jan-2009

3,jose madriz,281-330-8004,A,12.95,02-Jun-2008

MapReduce之Reduce Join的更多相关文章

Hadoop学习之路（二十一）MapReduce实现Reduce Join（多个文件联合查询）
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce编程之Reduce Join多种应用场景与使用
在关系型数据库中 Join 是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求, 例如在数据分析时需要连接从不同的数据源中获取到数据.不同于传统的单机模式 ...
MapReduce的Reduce side Join
1. 简单介绍 reduce side join是全部join中用时最长的一种join,可是这样的方法可以适用内连接.left外连接.right外连接.full外连接和反连接等全部的join方式.r ...
MapReduce实现的Join
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce三种join实例分析
本文引自吴超博客实现原理 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式,其具体的实现原理如下: Map端的主要工作:为来自不同 ...
MapReduce中的Join
一. MR中的join的两种方式: 1.reduce side join(面试题) reduce side join是一种最简单的join方式,其主要思想如下: 在map阶段,map函数同时读取两个文 ...
MapReduce之Map Join
一介绍之所以存在Reduce Join,是因为在map阶段不能获取所有需要的join字段,即:同一个key对应的字段可能位于不同map中.Reduce side join是非常低效的,因为shuf ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
mapreduce作业reduce被大量kill掉
之前有一段时间.我们的hadoop2.4集群压力非常大.导致提交的job出现大量的reduce被kill掉.同样的job执行时间比在hadoop0.20.203上面长了非常多.这个问题事实上是redu ...

随机推荐

bootstrap colorscheme以及theme自动生成
http://paintstrap.com/ 是一个根据adobe kuler color scheme自动生成theme 的工具,比较直观好用,对于调整前端theme有一定参考意义
微软发布SQL Server on Linux
本文参考并翻译自:微软云计算与企业执行副总裁Scott Guthrie的博客. 过去的一年,不管是对于微软的数据业务,还是整个行业,都是令人惊喜的一年.在周四刚于纽约举行的Data Driven活动中 ...
Eclipse启动JVM机制
1.Eclipse启动的时候,会启动一个JVM来运行eclipse(因为Eclipse是Java代码实现的) 2.Eclipse启动一个带main的主类的时候,会单独启动一个JVM来运行他. 3.Ec ...
玩转Windows/Linux tftp命令
tftp很好理解, 主要用来传文件, 下面以我的操作来谈谈tftp中最重要的几个命令. 一. Windows上的sftp命令(据说Linux上也是这样, 当然此处是指非嵌入式的Linux) 步骤: a ...
学习Road map Part 01 数学
方法: 结合编程软件 matlab / octave / python / maxima / ruby 线性代数向量.行列式线性方程组 LU 分解特征值.对角化特征值算法
GO语言（六）接口使用
<music> |------<src> |-------<library> |-------manager.go |-------manager_test.go ...
linux-记录
查看运行的进程 ps -aux|grep java 找到要删除的进程的编号杀死进程 kill -9 1883(进程编号) 重启服务 sh satrtBussinessService.sh
Ubuntu下Apache配置网站根路径
安装之后apache默认的跟路径是/var/www/ 如何修改这个默认路径呢? 直接编辑/etc/apache2/sites-available/default-ssl.conf,将Docum ...
Yii 不完全解决方案（一）
此文意在记录 Yii 开发过程中的小问题解决方案 1. Yii 中 Js 和 Css 文件的引入. 我们就从最简单的问题开始吧,说起来也不是问题,只是语法罢了.假设我们的 js 文件都放在和 prot ...
理解JavaScript原始类型和引用类型
原始类型我们知道类型(type)定义为值的一个集合,所以每种原始类型定义了它包含的值的范围及其字面量表示形式.一共有5 种原始类型(primitive type),即 Undefined.Null. ...

MapReduce之Reduce Join

MapReduce之Reduce Join的更多相关文章

随机推荐

热门专题