【原创】MapReduce编程系列之表连接

问题描述

需要连接的表如下：其中左边是child，右边是parent，我们要做的是找出grandchild和grandparent的对应关系，为此需要进行表的连接。

Tom Lucy

Tom Jim

Lucy David

Lucy Lili

Jim Lilei

Jim SuSan

Lily Green

Lily Bians

Green Well

Green MillShell

Havid James

James LiT

Richard Cheng

Cheng LiHua

思路分析

　　诚然，在写MR程序的时候要结合MR数据处理的一些特性。例如如果我们用默认的TextInputFormat来处理传入的文件数据，传入的格式是key为行号，value为这一行的值（如上例中的第一行，key为0，value为[Tom,Lucy]），在shuffle过程中，我们的值如果有相同的key，会merge到一起（这一点很重要！）。我们利用shuffle阶段的特性，merge到一组的数据够成一组关系，然后我们在这组关系中想办法区分晚辈和长辈，最后对merge里的value一一作处理，分离出grandchild和grandparent的关系。

例如，Tom Lucy传入处理后我们将其反转，成为Lucy Tom输出。当然，输出的时候，为了达到join的效果，我们要输出两份，因为join要两个表，一个表为L1：child parent，一个表为L2：child parent，为了达到关联的目的和利用shuffle阶段的特性，我们需要将L1反转，把parent放在前面，这样L1表中的parent和L2表中的child如果字段是相同的那么在shuffle阶段就能merge到一起。还有，为了区分merge到一起后如何区分child和parent，我们把L1表中反转后的child（未来的 grandchild）字段后面加一个1，L2表中parent（未来的grandparent）字段后加2。

 package com.test.join;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.Iterator;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class STJoin {

     public static class STJoinMapper extends Mapper<Object, Text, Text, Text>{

         @Override

         protected void map(Object key, Text value, Context context)

                 throws IOException, InterruptedException {

             // TODO Auto-generated method stub

             String[] rela = value.toString().trim().split(" ",2);

             if(rela.length!=2)

                 return;

             String child = rela[0];

             String parent = rela[1];

             context.write(new Text(parent), new Text((child+"1")));

             context.write(new Text(child), new Text((parent+"2")));

         }

     }

     public static class STJoinReducer extends Reducer<Text, Text, Text, Text>{

         @Override

         protected void reduce(Text arg0, Iterable<Text> arg1,Context context)

                 throws IOException, InterruptedException {

             // TODO Auto-generated method stub

             ArrayList<String> grandParent = new ArrayList<>();

             ArrayList<String> grandChild = new ArrayList<>();

             Iterator<Text> iterator = arg1.iterator();

             while(iterator.hasNext()){

                 String text = iterator.next().toString();

                 if(text.endsWith("1"))

                     grandChild.add(text.substring(0, text.length()-1));

                 if(text.endsWith("2"))

                     grandParent.add(text.substring(0, text.length()-1));

             }

             for(String grandparent:grandParent){

                 for(String grandchild:grandChild){

                     context.write(new Text(grandchild), new Text(grandparent));

                 }

             }

         }

     }

     public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException {

         Configuration conf = new Configuration();

         Job job = new Job(conf,"STJoin");

         job.setMapperClass(STJoinMapper.class);

         job.setReducerClass(STJoinReducer.class);

         job.setOutputKeyClass(Text.class);

         job.setOutputValueClass(Text.class);

         FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hadoop/STJoin/joinFile"));

         FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/hadoop/STJoin/joinResult"));

         System.exit(job.waitForCompletion(true)?0:1);

     }

 }

结果显示

Richard    LiHua

Lily    Well

Lily    MillShell

Havid    LiT

Tom    Lilei

Tom    SuSan

Tom    Lili

Tom    David

以上代码在hadoop1.0.3平台实现

【原创】MapReduce编程系列之表连接的更多相关文章

Hadoop阅读笔记（三）——深入MapReduce排序和单表连接
继上篇了解了使用MapReduce计算平均数以及去重后,我们再来一探MapReduce在排序以及单表关联上的处理方法.在MapReduce系列的第一篇就有说过,MapReduce不仅是一种分布式的计算 ...
【SqlServer系列】表连接
1 概述 1.1 已发布[SqlServer系列]文章 [SqlServer系列]MYSQL安装教程 [SqlServer系列]数据库三大范式 [SqlServer系列]表单查询 1.2 本篇 ...
MapReduce编程系列 — 5：单表关联
1.项目名称: 2.项目数据: chile parentTom LucyTom JackJone LucyJone JackLucy MaryLucy Ben ...
【原创】MapReduce编程系列之二元排序
普通排序实现普通排序的实现利用了按姓名的排序,调用了默认的对key的HashPartition函数来实现数据的分组.partition操作之后写入磁盘时会对数据进行排序操作(对一个分区内的数据作排序 ...
MapReduce编程系列 — 6：多表关联
1.项目名称: 2.程序代码: 版本一(详细版): package com.mtjoin; import java.io.IOException; import java.util.Iterator; ...
MapReduce编程系列 — 4：排序
1.项目名称: 2.程序代码: package com.sort; import java.io.IOException; import org.apache.hadoop.conf.Configur ...
MapReduce编程系列 — 3：数据去重
1.项目名称: 2.程序代码: package com.dedup; import java.io.IOException; import org.apache.hadoop.conf.Configu ...
MapReduce编程系列 — 2：计算平均分
1.项目名称: 2.程序代码: package com.averagescorecount; import java.io.IOException; import java.util.Iterator ...
MapReduce编程系列 — 1：计算单词
1.代码: package com.mrdemo; import java.io.IOException; import java.util.StringTokenizer; import org.a ...

随机推荐

2014年度辛星css教程夏季版第五节
本小节我们讲解css中的”盒模型“,即”box model“,它通常用于在布局的时候使用,这个”盒模型“也有人成为”框模型“,其实原理都一样,它的大致原理是这样的,它把一个HTML元素分为了这么几个部 ...
EL四大作用域 9个jsp对象有效范围及对应的类
java中request,session,application的作用范围 page,request,session,application四者的作用范围: page的作用范围是当前页面:对应El表达 ...
常用的四种CSS样式表格
1. 单像素边框CSS表格这是一个很常用的表格样式. [html] <style type="text/css"> table.gridtable { font-fa ...
安装java memcached client到本地maven repository
由于目前java memcached client没有官方的maven repository可供使用,因此使用时需要手动将其安装到本地repository.java memcached client的 ...
[topcoder]SmartWordToy
广度搜索BFS,要用Queue.还不是很熟,这道题帮助理清一些思绪了.其实这道题是求最短路径,所以BFS遇到第一个就可以返回了,所以后面有些现有大小和历史大小的判断可以省却. 过程中拿数组存step还 ...
ANDROID_MARS学习笔记_S01原始版_010_ContentProvider
一.简介一.代码1.xml(1)main.xml <?xml version="1.0" encoding="utf-8"?> <Linea ...
产品不应该大而全，而是应该小而精（DropBox有感，产品要1分钟学会）
昨天试用了一下DROPBOX的个人版,对它的功能与界面简单深感震惊. 后来与一位业内朋友交流了一下,他说: 产品一般都是通过一个点来做.把一个点做到最好有可能会成为平台.另外还要在合适的时间做合适的事 ...
使用Html.fromHtml将html格式字符串应用到textview上面
在android中,有一个容易遗忘的Html.fromhtml方法,意思是可以将比如文本框中的字符串进行HTML格式化,支持的还是很多的, 但要注意的是要在string.xml中用<!--cda ...
FastJSON学习
这几天在用FastJSON,发现需要测试一些关键点,包括: 1.是否支持内部类:测试结果是支持,但是需要设置为静态类(static) 2.是否支持继承的自动序列化及反序列化:测试结果是支持 3.缺字段 ...
[Android] 修改设备访问权限
在硬件抽象层模块中,我们是调用open函数来打开对应的设备文件的.例如,在2.3.2小节中开发的硬件抽象层模块freg中,函数freg_device_open调用open函数来打开设备文件/dev/f ...

【原创】MapReduce编程系列之表连接

【原创】MapReduce编程系列之表连接的更多相关文章

随机推荐

热门专题