使用MapReduce实现join操作

　在关系型数据库中，要实现join操作是非常方便的，通过sql定义的join原语就可以实现。在hdfs存储的海量数据中，要实现join操作，可以通过HiveQL很方便地实现。不过HiveQL也是转化成MapReduce来完成操作，本文首先研究如何通过编写MapReduce程序来完成join操作。

一、Map-Join：在Reduce端完成的join操作

　假设存在用户数据文件users.txt和用户登录日志数据文件login_logs.txt，数据内容分别如下所示：

　用户数据文件user.txt，列：userid、name：

1    LiXiaolong

2    JetLi

3    Zhangsan

4    Lisi

5    Wangwu

　用户登录日志数据文件login_logs.txt，列：userid、login_time、login_ip：

1    2015-06-07 15:10:18    192.168.137.101

3    2015-06-07 15:12:18    192.168.137.102

3    2015-06-07 15:18:36    192.168.137.102

1    2015-06-07 15:22:38    192.168.137.101

1    2015-06-07 15:26:11    192.168.137.103

　期望计算结果：

1    LiXiaolong    2015-06-07 15:10:18    192.168.137.101

1    LiXiaolong    2015-06-07 15:22:38    192.168.137.101

1    LiXiaolong    2015-06-07 15:26:11    192.168.137.103

3    Zhangsan    2015-06-07 15:12:18    192.168.137.102

3    Zhangsan    2015-06-07 15:18:36    192.168.137.102

　计算思路：

　 1) 在map阶段可以通过文件路径判断来自users.txt还是login_logs.txt，来自users.txt的数据输出<userid, 'u#'+name>，来自login_logs.txt的数据输出<userid,'l#'+login_time+'\t'+login_ip>；

　 2) 在reduce阶段将来自不同表的数据区分开，然后做笛卡尔乘积，输出结果；

　实现代码：

package com.hicoor.hadoop.mapreduce;

import java.io.IOException;

import java.net.URI;

import java.net.URISyntaxException;

import java.util.LinkedList;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReduceJoinDemo {

    public static final String DELIMITER = "\t"; // 字段分隔符

    static class MyMappper extends Mapper<LongWritable, Text, Text, Text> {

        @Override

        protected void map(LongWritable key, Text value,

                Mapper<LongWritable, Text, Text, Text>.Context context)

                throws IOException, InterruptedException {

            FileSplit split = (FileSplit) context.getInputSplit();

            String filePath = split.getPath().toString();

            // 获取记录字符串

            String line = value.toString();

            // 抛弃空记录

            if (line == null || line.trim().equals("")) return;

            String[] values = line.split(DELIMITER);

            // 处理user.txt数据

            if (filePath.contains("users.txt")) {

                if (values.length < 2) return;

                context.write(new Text(values[0]), new Text("u#" + values[1]));

            }

            // 处理login_logs.txt数据

            else if (filePath.contains("login_logs.txt")) {

                if (values.length < 3) return;

                context.write(new Text(values[0]), new Text("l#" + values[1] + DELIMITER + values[2]));

            }

        }

    }

    static class MyReducer extends Reducer<Text, Text, Text, Text> {

        @Override

        protected void reduce(Text key, Iterable<Text> values,

                Reducer<Text, Text, Text, Text>.Context context)

                throws IOException, InterruptedException {

            LinkedList<String> linkU = new LinkedList<String>();  //users值

            LinkedList<String> linkL = new LinkedList<String>();  //login_logs值

            for (Text tval : values) {

                String val = tval.toString();

                if(val.startsWith("u#")) {

                    linkU.add(val.substring(2));

                } else if(val.startsWith("l#")) {

                    linkL.add(val.substring(2));

                }

            }

            for (String u : linkU) {

                for (String l : linkL) {

                    context.write(key, new Text(u + DELIMITER + l));

                }

            }

        }

    }

    private final static String FILE_IN_PATH = "hdfs://cluster1/join/in";

    private final static String FILE_OUT_PATH = "hdfs://cluster1/join/out";

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {

        System.setProperty("hadoop.home.dir", "D:\\desktop\\hadoop-2.6.0");

        Configuration conf = getHAContiguration();

        // 删除已存在的输出目录

        FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);

        if (fileSystem.exists(new Path(FILE_OUT_PATH))) {

            fileSystem.delete(new Path(FILE_OUT_PATH), true);

        }

        Job job = Job.getInstance(conf, "Reduce Join Demo");

        job.setMapperClass(MyMappper.class);

        job.setJarByClass(ReduceJoinDemo.class);

        job.setReducerClass(MyReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));

        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        job.waitForCompletion(true);

    }

    private static Configuration getHAContiguration() {

        Configuration conf = new Configuration();

        conf.setStrings("dfs.nameservices", "cluster1");

        conf.setStrings("dfs.ha.namenodes.cluster1", "hadoop1,hadoop2");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop1", "172.19.7.31:9000");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop2", "172.19.7.32:9000");

        // 必须配置，可以通过该类获取当前处于active状态的namenode

        conf.setStrings("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

        return conf;

    }

}

二、Reduce-join：在Reduce端完成的join操作

　当join的两个表中有一个表数据量不大，可以轻松加载到各节点内存中时，可以使用DistributedCache将小表的数据加载到分布式缓存，然后MapReduce框架会缓存数据分发到需要执行map任务的节点上，在map节点上直接调用本地的缓存文件参与计算。在Map端完成join操作，可以降低网络传输到Reduce端的数据流量，有利于提高整个作业的执行效率。

　计算思路：

　假设users.txt用户表数据量较小，则将users.txt数据添加到DistributedCache分布式缓存中，在map计算中读取本地缓存的users.txt数据并将login_logs.txt中的userid数据翻译成用户名，本例无需Reduce参与。

　实现代码：

package com.hicoor.hadoop.mapreduce;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.net.URISyntaxException;

import java.util.Map;

import java.util.Scanner;

import java.util.StringTokenizer;

import org.apache.commons.collections.map.HashedMap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.examples.SecondarySort.Reduce;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.filecache.DistributedCache;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.yarn.webapp.example.MyApp.MyController;

public class DistributedCacheDemo {

    public static final String DELIMITER = "\t"; // 字段分隔符

    static class MyMappper extends Mapper<LongWritable, Text, Text, Text> {

        private Map<String, String> userMaps = new HashedMap();

        @Override

        protected void setup(Mapper<LongWritable,Text,Text,Text>.Context context) throws IOException ,InterruptedException {

            //可以通过localCacheFiles获取本地缓存文件的路径

            //Configuration conf = context.getConfiguration();

            //Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(conf);

            //此处使用快捷方式users.txt访问

            FileReader fr = new FileReader("users.txt");

            BufferedReader br = new BufferedReader(fr);

            String line;

            while((line = br.readLine()) != null) {

                //map端加载缓存数据

                String[] splits = line.split(DELIMITER);

                if(splits.length < 2) continue;

                userMaps.put(splits[0], splits[1]);

            }

        };

        @Override

        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {

            // 获取记录字符串

            String line = value.toString();

            // 抛弃空记录

            if (line == null || line.trim().equals("")) return;

            String[] values = line.split(DELIMITER);

            if(values.length < 3) return;

            String name = userMaps.get(values[0]);

            Text t_key = new Text(values[0]);

            Text t_value = new Text(name + DELIMITER + values[1] + DELIMITER + values[2]);

            context.write(t_key, t_value);

        }

    }

    private final static String FILE_IN_PATH = "hdfs://cluster1/join/in/login_logs.txt";

    private final static String FILE_OUT_PATH = "hdfs://cluster1/join/out";

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {

        System.setProperty("hadoop.home.dir", "D:\\desktop\\hadoop-2.6.0");

        Configuration conf = getHAConfiguration();

        // 删除已存在的输出目录

        FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);

        if (fileSystem.exists(new Path(FILE_OUT_PATH))) {

            fileSystem.delete(new Path(FILE_OUT_PATH), true);

        }

        //添加分布式缓存文件 可以在map或reduce中直接通过users.txt链接访问对应缓存文件

        DistributedCache.addCacheFile(new URI("hdfs://cluster1/join/in/users.txt#users.txt"), conf);

        Job job = Job.getInstance(conf, "Map Distributed Cache Demo");

        job.setMapperClass(MyMappper.class);

        job.setJarByClass(DistributedCacheDemo.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));

        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        job.waitForCompletion(true);

    }

    private static Configuration getHAConfiguration() {

        Configuration conf = new Configuration();

        conf.setStrings("dfs.nameservices", "cluster1");

        conf.setStrings("dfs.ha.namenodes.cluster1", "hadoop1,hadoop2");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop1", "172.19.7.31:9000");

        conf.setStrings("dfs.namenode.rpc-address.cluster1.hadoop2", "172.19.7.32:9000");

        //必须配置，可以通过该类获取当前处于active状态的namenode

        conf.setStrings("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

        return conf;

    }

}

三、使用HiveQL来完成join

　使用HiveQL可以轻松完成该任务，只需使用表连接语句，hive会自动生成并优化mapreduce程序来执行查询操作。

　实现步骤：

　 1) 在/join/in/目录下创建users目录和login_logs目录，分别将users.txt和login_logs.txt移动到对应目录中；

　 2) 创建users外部表：create external table users(userid int, name string) row format delimited fields terminated by '\t' location '/join/in/users';

　 3) 创建login_logs外部表：create external table login_logs(userid string,login_time string,login_ip string) row format delimited fields terminated by '\t' location '/join/in/login_logs';

　 4）执行连接查询并保存结果：create table user_login_logs as select A.*,B.login_time,B.login_ip from users A,login_logs B where A.userid=B.userid;

四、总结

　通常情况下我们会使用hive来帮助我们完成join操作，map-join和reduce-join用于实现一些复杂的、特殊的需求。此外还有一种实现方式：SemiJoin，这是一种介于map-join和reduce-join之间的方法，就是在map端过滤掉一些数据，在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输，从而减少了shuffle的网络传输量，使整体效率得到提高。

　执行效率：map-join>SemiJoin>reduce-join。

　参考：http://database.51cto.com/art/201410/454277.htm

使用MapReduce实现join操作的更多相关文章

Hadoop基础-MapReduce的Join操作
Hadoop基础-MapReduce的Join操作作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.连接操作Map端Join(适合处理小表+大表的情况) no001 no002 ...
案例-使用MapReduce实现join操作
哈喽-各位小伙伴们中秋快乐,好久没更新新的文章啦,今天分享如何使用mapreduce进行join操作. 在离线计算中,我们常常不只是会对单一一个文件进行操作,进行需要进行两个或多个文件关联出更多数据, ...
[MapReduce_add_4] MapReduce 的 join 操作
0. 说明 Map 端 join && Reduce 端 join 1. Map 端 join Map 端 join:大表+小表 => 将小表加入到内存,迭代大表每一行,与之进行 ...
0 MapReduce实现Reduce Side Join操作
一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表A) 建表语句: create table if not exists m_ys_lab_jointest_a ( ...
mapreduce join操作
上次和朋友讨论到mapreduce,join应该发生在map端,理由太想当然到sql里面的执行过程了 wheremap端 join在map之前(笛卡尔积),但实际上网上看了,mapreduce的笛卡尔 ...
MapReduce实现ReduceSideJoin操作
本文转载于:http://blog.csdn.net/xyilu/article/details/8996204 一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表 ...
MapReduce 实现数据join操作
前段时间有一个业务需求,要在外网商品(TOPB2C)信息中加入联营自营识别的字段.但存在的一个问题是,商品信息和自营联营标示数据是两份数据:商品信息较大,是存放在hbase中.他们之前唯一的 ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
hadoop中MapReduce多种join实现实例分析
转载自:http://zengzhaozheng.blog.51cto.com/8219051/1392961 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之 ...

随机推荐

解决VS+opencv中Debug版本与Release版本lib切换的问题
Author: Maddock Date: 2015-03-26 09:34:48 问题来源:http://bbs.csdn.net/topics/390733725 PS: 按照上述方法做的时候,在 ...
YUV RGB播放器打开，显示RGB数据
可以查看RGB像素数据可以通过菜单栏打开像素数据文件,也可以通过拖拽方式打开文件.如果文件名称中包含了“{w}x{h}”这样的字符串(例如“test_320x420.yuv”),系统会自动解析为该像 ...
zookeeper原理及作用
ZooKeeper是Hadoop Ecosystem中非常重要的组件,它的主要功能是为分布式系统提供一致性协调(Coordination)服务,与之对应的Google的类似服务叫Chubby.今天这篇 ...
RStudio技巧01_美化RStudio的帮助页面
R中的package及其函数实在太多,经常遇到不会使用或者忘记如何使用的的package和函数,所以总会查阅帮助文档,在Rstudio中提供了专门的help面板,当遇到不懂的package或者函数时只 ...
搭把手教美工妹妹如何通过升级SSD提升电脑性能
-----by LinHan 不单单适用于妹子,我这名的意思的妹子也能看懂. 以下教程依据实践和部分互联网资料总结得出,向博客园, CSDN的前辈们致谢:同时,如有说的不正确或有不到位的地方,麻烦指出 ...
ACM: FZU 2110 Star - 数学几何 - 水题
FZU 2110 Star Time Limit:1000MS Memory Limit:32768KB 64bit IO Format:%I64d & %I64u Pr ...
IBatis按条件分页查询
XML中代码 <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE sqlMap PUBLIC & ...
第九周 psp
团队项目PSP 一:表格 C类型 C内容 S开始时间 E结束时间 I时间间隔 T净时间(mins) 预计花费时间(mins) 讨论讨论用户界面 9:50 12:45 35 45 80 分析与 ...
java list去重
1.不带类型写法: 1 List listWithoutDup = new ArrayList(new HashSet(listWithDup)); 2.带类型写法(以String类型为例):1)Ja ...
CSS方法论完全总结
软件开发领域所有的工程问题,归根结底衍生自一个问题:代码量大了怎么办? 对于CSS而言,因代码量增大导致的核心问题是命名冲突. 解决命名冲突的方法论是模块化,围绕此方法论,演化出种种模块化方案. 一. ...

使用MapReduce实现join操作

使用MapReduce实现join操作的更多相关文章

随机推荐

热门专题