哈喽～各位小伙伴们中秋快乐，好久没更新新的文章啦，今天分享如何使用mapreduce进行join操作。

在离线计算中，我们常常不只是会对单一一个文件进行操作，进行需要进行两个或多个文件关联出更多数据，类似与sql中的join操作。

今天就跟大家分享一下如何在MapReduce中实现join操作

需求

现有两张，一张是产品信息表，一张是订单表。订单表中只表存了产品ID，如果想要查出订单以及产品的相关信息就必须使用关联。

实现

根据MapReduce特性，大家都知道在reduce端，相同key的key，value对会被放到同一个reduce方法中(不设置partition的话)。

利用这个特点我们可以轻松实现join操作，请看下面示例。

产品表

ID	brand	model
p0001	苹果	iphone11 pro max
p0002	华为	p30
p0003	小米	mate10

订单表

id	name	address	produceID	num
00001	kris	深圳市福田区	p0001	1
00002	pony	深圳市南山区	p0001	2
00003	jack	深圳市坂田区	p0001	3

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，需要用mapreduce程序来实现一下SQL查询运算:

select a.id,a.name,a.address,a.num from t_orders a join t_products on a.productID=b.ID

MapReduce实现思路

通过将关联的条件(prodcueID)作为map输出的key，将两表满足join条件的数据并携带数据所来源的文件信息，发往同一个

reduce task，在reduce中进行数据的串联

实现方式一-reduce端join

定义一个Bean

public class RJoinInfo implements Writable{

    private String customerName="";

    private String customerAddr="";

    private String orderID="";

    private int orderNum;

    private String productID="";

    private String productBrand="";

    private String productModel="";

//    0是产品，1是订单

    private int flag;

    setter/getter

编写Mapper

public class RJoinMapper extends Mapper<LongWritable,Text,Text,RJoinInfo> {

    private static Logger logger = LogManager.getLogger(RJoinMapper.class);

    private RJoinInfo rJoinInfo = new RJoinInfo();

    private Text k = new Text();

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

//        输入方式支持很多中包括数据库等等。这里用的是文件，因此可以直接强转为文件切片

        FileSplit fileSplit = (FileSplit) context.getInputSplit();

//        获取文件名称

        String name = fileSplit.getPath().getName();

        logger.info("splitPathName:"+name);

        String line = value.toString();

        String[] split = line.split("\t");

        String productID = "";

            if(name.contains("product")){

                productID = split[0];

                String setProductBrand = split[1];

                String productModel = split[2];

                rJoinInfo.setProductID(productID);

                rJoinInfo.setProductBrand(setProductBrand);

                rJoinInfo.setProductModel(productModel);

                rJoinInfo.setFlag(0);

            }else if(name.contains("orders")){

                String orderID = split[0];

                String customerName = split[1];

                String cutsomerAddr = split[2];

                productID = split[3];

                String orderNum = split[4];

                rJoinInfo.setProductID(productID);

                rJoinInfo.setCustomerName(customerName);

                rJoinInfo.setCustomerAddr(cutsomerAddr);

                rJoinInfo.setOrderID(orderID);

                rJoinInfo.setOrderNum(Integer.parseInt(orderNum));

                rJoinInfo.setFlag(1);

            }

        k.set(productID);

        context.write(k,rJoinInfo);

    }

}

代码解释，这里根据split的文件名，判断是products还是orders，

然后根据是product还是orders获取不同的数据，最用都以productID为Key发送给Reduce端

编写Reducer

public class RJoinReducer extends Reducer<Text,RJoinInfo,RJoinInfo,NullWritable> {

    private static Logger logger = LogManager.getLogger(RJoinReducer.class);

    @Override

    protected void reduce(Text key, Iterable<RJoinInfo> values, Context context) throws IOException, InterruptedException {

        List<RJoinInfo> orders = new ArrayList<>();

        String productID = key.toString();

        logger.info("productID:"+productID);

        RJoinInfo rJoinInfo = new RJoinInfo();

        for (RJoinInfo value : values) {

            int flag = value.getFlag();

            if (flag == 0) {

//                产品

                try {

                    BeanUtils.copyProperties(rJoinInfo,value);

                } catch (IllegalAccessException e) {

                    logger.error(e.getMessage());

                } catch (InvocationTargetException e) {

                    logger.error(e.getMessage());

                }

            }else {

//                订单

                RJoinInfo orderInfo = new RJoinInfo();

                try {

                    BeanUtils.copyProperties(orderInfo,value);

                } catch (IllegalAccessException e) {

                    logger.error(e.getMessage());

                } catch (InvocationTargetException e) {

                    logger.error(e.getMessage());

                }

                orders.add(orderInfo);

            }

        }

        for (RJoinInfo order : orders) {

            rJoinInfo.setOrderNum(order.getOrderNum());

            rJoinInfo.setOrderID(order.getOrderID());

            rJoinInfo.setCustomerName(order.getCustomerName());

            rJoinInfo.setCustomerAddr(order.getCustomerAddr());

//          只输出key即可，value可以使用nullwritable

            context.write(rJoinInfo,NullWritable.get());

        }

    }

}

代码解释:根据productID会分为不同的组发到reduce端，reduce端拿到后一组数据后，其中有一个产品对象和多个订单对象。

遍历每一个对象，根据flag区分产品和订单。保存产品对象，获取每个订单对象到一个集合中。当我们对每个对象都分好

类后，遍历订单集合将订单和产品信息集合，然后输出。

注意:我们这里效率虽然不是最高的，主要是想说明join的思路。

编写Driver

public class RJoinDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();

//        conf.set("mapreduce.framework.name","yarn");

//        conf.set("yarn.resourcemanager.hostname","server1");

//        conf.set("fs.defaultFS","hdfs://server1:9000");

        conf.set("mapreduce.framework.name","local");

        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);

//       如果是本地运行，可以不用设置jar包的路径，因为不用拷贝jar到其他地方

        job.setJarByClass(RJoinDriver.class);

//        job.setJar("/Users/kris/IdeaProjects/bigdatahdfs/target/rjoin.jar");

        job.setMapperClass(RJoinMapper.class);

        job.setReducerClass(RJoinReducer.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(RJoinInfo.class);

        job.setOutputKeyClass(RJoinInfo.class);

        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input"));

        FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output"));

        boolean waitForCompletion = job.waitForCompletion(true);

        System.out.println(waitForCompletion);

    }

}

上面实现的这种方式有个缺点，就是join操作是在reduce阶段完成的，reduce端的处理压力太大，map节点的运算负载则很低，资源利用率不高，且在reduce阶段极易产生数据倾斜

实现方式二-map端join

这种方式适用于关联表中有小表的情形:

可以将小表分发到所有的map节点，这样，map节点就可以在本地对自己所读到的大表数据进行join操作并输出结果，

可以大大提高join操作的并发度，加快处理速度。

编写Mapper

在Mapper端我们一次性加载数据或者用Distributedbache将文件拷贝到每一个运行的maptask的节点上加载

这里我们使用第二种，在mapper类中定义好小表进行join

static class RjoinMapper extends Mapper<LongWritable,Text,RJoinInfo,NullWritable>{

        private static Map<String, RJoinInfo> productMap = new HashMap<>();

//      在循环调用map方法之前会先调用setup方法。因此我们可以在setup方法中，先对文件进行处理

        @Override

        protected void setup(Context context) throws IOException, InterruptedException {

            //通过这几句代码可以获取到cache file的本地绝对路径，测试验证用

            URI[] cacheFiles = context.getCacheFiles();

            System.out.println(Arrays.toString(new URI[]{cacheFiles[0]}));

//          直接指定名字，默认在工作文件夹的目录下查找 1⃣

            try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("products.txt")))){

                String line;

                while ((line = bufferedReader.readLine())!=null){

                    String[] split = line.split("\t");

                    String productID = split[0];

                    String setProductBrand = split[1];

                    String productModel = split[2];

                    RJoinInfo rJoinInfo = new RJoinInfo();

                    rJoinInfo.setProductID(productID);

                    rJoinInfo.setProductBrand(setProductBrand);

                    rJoinInfo.setProductModel(productModel);

                    rJoinInfo.setFlag(0);

                    productMap.put(productID, rJoinInfo);

                }

            }

            super.setup(context);

        }

        @Override

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            FileSplit fileSplit = (FileSplit)context.getInputSplit();

            String name = fileSplit.getPath().getName();

            if (name.contains("orders")) {

                String line = value.toString();

                String[] split = line.split("\t");

                String orderID = split[0];

                String customerName = split[1];

                String cutsomerAddr = split[2];

                String productID = split[3];

                String orderNum = split[4];

                RJoinInfo rJoinInfo = productMap.get(productID);

                rJoinInfo.setProductID(productID);

                rJoinInfo.setCustomerName(customerName);

                rJoinInfo.setCustomerAddr(cutsomerAddr);

                rJoinInfo.setOrderID(orderID);

                rJoinInfo.setOrderNum(Integer.parseInt(orderNum));

                rJoinInfo.setFlag(1);

                context.write(rJoinInfo, NullWritable.get());

            }

        }

    }

代码解释:这里我们又重写了一个setup()方法，这个方法会在执行map()方法前先执行，因此我们可以在这个方法中事先加载好数据。

在上述代码中，我们直接指定名字就拿到了product.txt文件，这个究竟这个文件是怎么复制在maptask的节点上的呢，还要看下面的driver

编写Driver

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {

        Configuration conf = new Configuration();

        conf.set("mapreduce.framework.name","local");

        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);

        job.setJarByClass(RJoinDemoInMapDriver.class);

        job.setMapperClass(RjoinMapper.class);

        job.setOutputKeyClass(RJoinInfo.class);

        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input"));

        FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output2"));

//        指定需要缓存一个文件到所有的maptask运行节点工作目录

//        job.addFileToClassPath(); 将普通文件缓存到task运行节点的classpath下

//        job.addArchiveToClassPath();缓存jar包到task运行节点的classpath下

//        job.addCacheArchive();缓存压缩包文件到task运行节点的工作目录

//        job.addCacheFile();将普通文件 1⃣

        job.addCacheFile(new URI("/Users/kris/Downloads/rjoin/products.txt"));

//      设置reduce的数量为0

        job.setNumReduceTasks(0);

        boolean waitForCompletion = job.waitForCompletion(true);

        System.out.println(waitForCompletion);

    }

代码解释:上述Driver中，我们通过job.addCacheFile()指定了一个URI本地地址，运行时mapreduce就会将这个文件拷贝到maptask的运行工作目录中。

好啦～本期分享代码量偏多，主要是想分享如何使用mapreduce进行join操作的思路。下一篇我会再讲一下计算共同好友的思路以及代码～

		公众号搜索:喜讯XiCent    获取更多福利资源～～～～

本文由博客一文多发平台 OpenWrite 发布！

案例-使用MapReduce实现join操作的更多相关文章

Hadoop基础-MapReduce的Join操作
Hadoop基础-MapReduce的Join操作作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.连接操作Map端Join(适合处理小表+大表的情况) no001 no002 ...
使用MapReduce实现join操作
在关系型数据库中,要实现join操作是非常方便的,通过sql定义的join原语就可以实现.在hdfs存储的海量数据中,要实现join操作,可以通过HiveQL很方便地实现.不过HiveQL也是转化成 ...
[MapReduce_add_4] MapReduce 的 join 操作
0. 说明 Map 端 join && Reduce 端 join 1. Map 端 join Map 端 join:大表+小表 => 将小表加入到内存,迭代大表每一行,与之进行 ...
0 MapReduce实现Reduce Side Join操作
一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表A) 建表语句: create table if not exists m_ys_lab_jointest_a ( ...
mapreduce join操作
上次和朋友讨论到mapreduce,join应该发生在map端,理由太想当然到sql里面的执行过程了 wheremap端 join在map之前(笛卡尔积),但实际上网上看了,mapreduce的笛卡尔 ...
MapReduce实现ReduceSideJoin操作
本文转载于:http://blog.csdn.net/xyilu/article/details/8996204 一.准备两张表以及对应的数据 (1)m_ys_lab_jointest_a(以下简称表 ...
MapReduce 实现数据join操作
前段时间有一个业务需求,要在外网商品(TOPB2C)信息中加入联营自营识别的字段.但存在的一个问题是,商品信息和自营联营标示数据是两份数据:商品信息较大,是存放在hbase中.他们之前唯一的 ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
Hadoop.2.x_高级应用_二次排序及MapReduce端join
一.对于二次排序案例部分理解 1. 分析需求(首先对第一个字段排序,然后在对第二个字段排序) 杂乱的原始数据排序完成的数据 a,1 a,1 b,1 a,2 a,2 [排序] a,100 b,6 == ...

随机推荐

return EXIT_SUCCESS;
就是 return 0; EXIT_SUCCESS是C语言头文件库中定义的一个符号常量. 头文件stdlib.h中:#include <cstdlib> /* Definition of ...
mac下PHP安装mongo扩展
说明:mongo 和 mongodb是两个扩展,目前mongo扩展已经被废弃,建议使用mongodb扩展,但对于一些老项目还在使用mongo扩展的,请参考此文. mongodb和其他PHP扩展的安装方 ...
Linux -- GCC Built-in functions for atomic memory access
下列内建函数旨在兼容Intel Itanium Processor-specific Application Binary Interface, section 7.4. 因此,这些函数区别于普通的G ...
【用户体验】Google关闭标签的体验
https://www.uisdc.com/hunter/0221334485.html 在优设-细节猎人里有不少好案例.
shuffle 打乱一维数组
<?php $arr = range(,); print_r($arr); echo '<br />'; shuffle($arr); print_r($arr); ?> Ar ...
iOS-app发布新版本步骤
1
案例一：利于Python调用JSON对象来实现对XENA流量测试仪的灵活发包测试，能够适应Pair，Rotate，1-to-Many等多种拓扑模型
硬件:XENA Valkyrie 或 Vantage主机,测试板卡不限,本方法适用于其100M~400G所有速率端口环境配置:Python 3 实现功能: 1.控制流量仪进行流量测试,预定配置的流量 ...
C5. Spring 服务的注册与发现（Spring Cloud Eureka）
[概述] Eureka 作为 Spring Cloud 分布式解决方案中重要的一环,实现了服务的注册与发现等功能.Eureka 包括 Eureka Server 和 Eureka Client,具体的 ...
[转帖]当 K8s 集群达到万级规模，阿里巴巴如何解决系统各组件性能问题？
改天学习一下. https://www.cnblogs.com/alisystemsoftware/p/11570806.html 当 K8s 集群达到万级规模,阿里巴巴如何解决系统各组件性能问题 ...
sql server代理服务无法启动(SQL Agent)：OpenSQLServerInstanceRegKey:GetRegKeyAccessMask failed (reason: 2).
问题:从windows自带的事件查看器中查看到报错信息如下 OpenSQLServerInstanceRegKey:GetRegKeyAccessMask failed (reason: 2). (注 ...

案例-使用MapReduce实现join操作

需求

实现

产品表

订单表

MapReduce实现思路

实现方式一-reduce端join

定义一个Bean

编写Mapper

编写Reducer

编写Driver

实现方式二-map端join

编写Mapper

编写Driver

案例-使用MapReduce实现join操作的更多相关文章

随机推荐

热门专题