hadoop中MapReduce多种join实现实例分析

转载自:http://zengzhaozheng.blog.51cto.com/8219051/1392961

1、在Reudce端进行连接。

在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式，其具体的实现原理如下：

Map端的主要工作：为来自不同表（文件）的key/value对打标签以区别不同来源的记录。然后用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。

reduce端的主要工作：在reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录（在map阶段已经打标志）分开，最后进行笛卡尔只就ok了。原理非常简单，下面来看一个实例：

(1)自定义一个value返回类型:

package com.mr.reduceSizeJoin;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparable;

public class CombineValues implements WritableComparable{

    //private static final Logger logger = LoggerFactory.getLogger(CombineValues.class);

    private Text joinKey;//链接关键字

    private Text flag;//文件来源标志

    private Text secondPart;//除了链接键外的其他部分

    public void setJoinKey(Text joinKey) {

        this.joinKey = joinKey;

    }

    public void setFlag(Text flag) {

        this.flag = flag;

    }

    public void setSecondPart(Text secondPart) {

        this.secondPart = secondPart;

    }

    public Text getFlag() {

        return flag;

    }

    public Text getSecondPart() {

        return secondPart;

    }

    public Text getJoinKey() {

        return joinKey;

    }

    public CombineValues() {

        this.joinKey =  new Text();

        this.flag = new Text();

        this.secondPart = new Text();

    }

    @Override

    public void write(DataOutput out) throws IOException {

        this.joinKey.write(out);

        this.flag.write(out);

        this.secondPart.write(out);

    }

    @Override

    public void readFields(DataInput in) throws IOException {

        this.joinKey.readFields(in);

        this.flag.readFields(in);

        this.secondPart.readFields(in);

    }

    @Override

    public int compareTo(CombineValues o) {

        return this.joinKey.compareTo(o.getJoinKey());

    }

    @Override

    public String toString() {

        // TODO Auto-generated method stub

        return "[flag="+this.flag.toString()+",joinKey="+this.joinKey.toString()+",secondPart="+this.secondPart.toString()+"]";

    }

}

(2)map、reduce主体代码

package com.mr.reduceSizeJoin;

import java.io.IOException;

import java.util.ArrayList;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

/**

* @author zengzhaozheng

* 用途说明：

* reudce side join中的left outer join

* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)

* tb_dim_city.dat文件内容,分隔符为"|"：

* id     name  orderid  city_code  is_show

* 0       其他        9999     9999         0

* 1       长春        1        901          1

* 2       吉林        2        902          1

* 3       四平        3        903          1

* 4       松原        4        904          1

* 5       通化        5        905          1

* 6       辽源        6        906          1

* 7       白城        7        907          1

* 8       白山        8        908          1

* 9       延吉        9        909          1

* -------------------------风骚的分割线-------------------------------

* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

* tb_user_profiles.dat文件内容,分隔符为"|"：

* userID   network     flow    cityID

* 1           2G       123      1

* 2           3G       333      2

* 3           3G       555      1

* 4           2G       777      3

* 5           3G       666      4

*

* -------------------------风骚的分割线-------------------------------

*  结果：

*  1   长春  1   901 1   1   2G  123

*  1   长春  1   901 1   3   3G  555

*  2   吉林  2   902 1   2   3G  333

*  3   四平  3   903 1   4   2G  777

*  4   松原  4   904 1   5   3G  666

*/

public class ReduceSideJoin_LeftOuterJoin extends Configured implements Tool{

    private static final Logger logger = LoggerFactory.getLogger(ReduceSideJoin_LeftOuterJoin.class);

    public static class LeftOutJoinMapper extends Mapper {

        private CombineValues combineValues = new CombineValues();

        private Text flag = new Text();

        private Text joinKey = new Text();

        private Text secondPart = new Text();

        @Override

        protected void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            //获得文件输入路径

            String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

            //数据来自tb_dim_city.dat文件,标志即为"0"

            if(pathName.endsWith("tb_dim_city.dat")){

                String[] valueItems = value.toString().split("\\|");

                //过滤格式错误的记录

                if(valueItems.length != 5){

                    return;

                }

                flag.set("0");

                joinKey.set(valueItems[0]);

                secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);

                combineValues.setFlag(flag);

                combineValues.setJoinKey(joinKey);

                combineValues.setSecondPart(secondPart);

                context.write(combineValues.getJoinKey(), combineValues);

                }//数据来自于tb_user_profiles.dat，标志即为"1"

            else if(pathName.endsWith("tb_user_profiles.dat")){

                String[] valueItems = value.toString().split("\\|");

                //过滤格式错误的记录

                if(valueItems.length != 4){

                    return;

                }

                flag.set("1");

                joinKey.set(valueItems[3]);

                secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);

                combineValues.setFlag(flag);

                combineValues.setJoinKey(joinKey);

                combineValues.setSecondPart(secondPart);

                context.write(combineValues.getJoinKey(), combineValues);

            }

        }

    }

    public static class LeftOutJoinReducer extends Reducer {

        //存储一个分组中的左表信息

        private ArrayList leftTable = new ArrayList();

        //存储一个分组中的右表信息

        private ArrayList rightTable = new ArrayList();

        private Text secondPar = null;

        private Text output = new Text();

        /**

         * 一个分组调用一次reduce函数

         */

        @Override

        protected void reduce(Text key, Iterable value, Context context)

                throws IOException, InterruptedException {

            leftTable.clear();

            rightTable.clear();

            /**

             * 将分组中的元素按照文件分别进行存放

             * 这种方法要注意的问题：

             * 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，

             * 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最

             * 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。

             */

            for(CombineValues cv : value){

                secondPar = new Text(cv.getSecondPart().toString());

                //左表tb_dim_city

                if("0".equals(cv.getFlag().toString().trim())){

                    leftTable.add(secondPar);

                }

                //右表tb_user_profiles

                else if("1".equals(cv.getFlag().toString().trim())){

                    rightTable.add(secondPar);

                }

            }

            logger.info("tb_dim_city:"+leftTable.toString());

            logger.info("tb_user_profiles:"+rightTable.toString());

            for(Text leftPart : leftTable){

                for(Text rightPart : rightTable){

                    output.set(leftPart+ "\t" + rightPart);

                    context.write(key, output);

                }

            }

        }

    }

    @Override

    public int run(String[] args) throws Exception {

          Configuration conf=getConf(); //获得配置文件对象

            Job job=new Job(conf,"LeftOutJoinMR");

            job.setJarByClass(ReduceSideJoin_LeftOuterJoin.class);

            FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

            FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径

            job.setMapperClass(LeftOutJoinMapper.class);

            job.setReducerClass(LeftOutJoinReducer.class);

            job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

            job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格格式

            //设置map的输出key和value类型

            job.setMapOutputKeyClass(Text.class);

            job.setMapOutputValueClass(CombineValues.class);

            //设置reduce的输出key和value类型

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(Text.class);

            job.waitForCompletion(true);

            return job.isSuccessful()?0:1;

    }

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        try {

            int returnCode =  ToolRunner.run(new ReduceSideJoin_LeftOuterJoin(),args);

            System.exit(returnCode);

        } catch (Exception e) {

            // TODO Auto-generated catch block

            logger.error(e.getMessage());

        }

    }

}

其中具体的分析以及数据的输出输入请看代码中的注释已经写得比较清楚了，这里主要分析一下reduce join的一些不足。之所以会存在reduce join这种方式，我们可以很明显的看出原：因为整体数据被分割了，每个map task只处理一部分数据而不能够获取到所有需要的join字段，因此我们需要在讲join key作为reduce端的分组将所有join key相同的记录集中起来进行处理，所以reduce join这种方式就出现了。这种方式的缺点很明显就是会造成map和reduce端也就是shuffle阶段出现大量的数据传输，效率很低.

2、在Map端进行连接。

使用场景：一张表十分小、一张表很大。

用法:在提交作业的时候先将小表文件放到该作业的DistributedCache中，然后从DistributeCache中取出该小表进行join key / value解释分割放到内存中（可以放大Hash Map等等容器中）。然后扫描大表，看大表中的每条记录的join key /value值是否能够在内存中找到相同join key的记录，如果有则直接输出结果。

直接上代码，比较简单：

package com.mr.mapSideJoin;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.util.HashMap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

/**

* @author zengzhaozheng

*

* 用途说明：

* Map side join中的left outer join

* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)，

* 假设tb_dim_city文件记录数很少，tb_dim_city.dat文件内容,分隔符为"|"：

* id     name  orderid  city_code  is_show

* 0       其他        9999     9999         0

* 1       长春        1        901          1

* 2       吉林        2        902          1

* 3       四平        3        903          1

* 4       松原        4        904          1

* 5       通化        5        905          1

* 6       辽源        6        906          1

* 7       白城        7        907          1

* 8       白山        8        908          1

* 9       延吉        9        909          1

* -------------------------风骚的分割线-------------------------------

* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

* tb_user_profiles.dat文件内容,分隔符为"|"：

* userID   network     flow    cityID

* 1           2G       123      1

* 2           3G       333      2

* 3           3G       555      1

* 4           2G       777      3

* 5           3G       666      4

* -------------------------风骚的分割线-------------------------------

*  结果：

*  1   长春  1   901 1   1   2G  123

*  1   长春  1   901 1   3   3G  555

*  2   吉林  2   902 1   2   3G  333

*  3   四平  3   903 1   4   2G  777

*  4   松原  4   904 1   5   3G  666

*/

public class MapSideJoinMain extends Configured implements Tool{

    private static final Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);

    public static class LeftOutJoinMapper extends Mapper {

        private HashMap city_info = new HashMap();

        private Text outPutKey = new Text();

        private Text outPutValue = new Text();

        private String mapInputStr = null;

        private String mapInputSpit[] = null;

        private String city_secondPart = null;

        /**

         * 此方法在每个task开始之前执行，这里主要用作从DistributedCache

         * 中取到tb_dim_city文件，并将里边记录取出放到内存中。

         */

        @Override

        protected void setup(Context context)

                throws IOException, InterruptedException {

            BufferedReader br = null;

            //获得当前作业的DistributedCache相关文件

            Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

            String cityInfo = null;

            for(Path p : distributePaths){

                if(p.toString().endsWith("tb_dim_city.dat")){

                    //读缓存文件，并放到mem中

                    br = new BufferedReader(new FileReader(p.toString()));

                    while(null!=(cityInfo=br.readLine())){

                        String[] cityPart = cityInfo.split("\\|",5);

                        if(cityPart.length ==5){

                            city_info.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);

                        }

                    }

                }

            }

        }

        /**

         * Map端的实现相当简单，直接判断tb_user_profiles.dat中的

         * cityID是否存在我的map中就ok了，这样就可以实现Map Join了

         */

        @Override

        protected void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            //排掉空行

            if(value == null || value.toString().equals("")){

                return;

            }

            mapInputStr = value.toString();

            mapInputSpit = mapInputStr.split("\\|",4);

            //过滤非法记录

            if(mapInputSpit.length != 4){

                return;

            }

            //判断链接字段是否在map中存在

            city_secondPart = city_info.get(mapInputSpit[3]);

            if(city_secondPart != null){

                this.outPutKey.set(mapInputSpit[3]);

                this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);

                context.write(outPutKey, outPutValue);

            }

        }

    }

    @Override

    public int run(String[] args) throws Exception {

            Configuration conf=getConf(); //获得配置文件对象

            DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//为该job添加缓存文件

            Job job=new Job(conf,"MapJoinMR");

            job.setNumReduceTasks(0);

            FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

            FileOutputFormat.setOutputPath(job, new Path(args[2])); //设置reduce输出文件路径

            job.setJarByClass(MapSideJoinMain.class);

            job.setMapperClass(LeftOutJoinMapper.class);

            job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

            job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式

            //设置map的输出key和value类型

            job.setMapOutputKeyClass(Text.class);

            //设置reduce的输出key和value类型

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(Text.class);

            job.waitForCompletion(true);

            return job.isSuccessful()?0:1;

    }

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        try {

            int returnCode =  ToolRunner.run(new MapSideJoinMain(),args);

            System.exit(returnCode);

        } catch (Exception e) {

            // TODO Auto-generated catch block

            logger.error(e.getMessage());

        }

    }

}

这里说说DistributedCache。DistributedCache是分布式缓存的一种实现，它在整个MapReduce框架中起着相当重要的作用，他可以支撑我们写一些相当复杂高效的分布式程序。说回到这里，JobTracker在作业启动之前会获取到DistributedCache的资源uri列表，并将对应的文件分发到各个涉及到该作业的任务的TaskTracker上。另外，关于DistributedCache和作业的关系，比如权限、存储路径区分、public和private等属性，接下来有用再整理研究一下写一篇blog，这里就不详细说了。

另外还有一种比较变态的Map Join方式，就是结合HBase来做Map Join操作。这种方式完全可以突破内存的控制，使你毫无忌惮的使用Map Join，而且效率也非常不错。

3、SemiJoin。

SemiJoin就是所谓的半连接，其实仔细一看就是reduce join的一个变种，就是在map端过滤掉一些数据，在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输，从而减少了shuffle的网络传输量，使整体效率得到提高，其他思想和reduce join是一模一样的。说得更加接地气一点就是将小表中参与join的key单独抽出来通过DistributedCach分发到相关节点，然后将其取出放到内存中（可以放到HashSet中），在map阶段扫描连接表，将join key不在内存HashSet中的记录过滤掉，让那些参与join的记录通过shuffle传输到reduce端进行join操作，其他的和reduce join都是一样的。看代码：

package com.mr.SemiJoin;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.util.ArrayList;

import java.util.HashSet;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

/**

* @author zengzhaozheng

*

* 用途说明：

* reudce side join中的left outer join

* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)

* tb_dim_city.dat文件内容,分隔符为"|"：

* id     name  orderid  city_code  is_show

* 0       其他        9999     9999         0

* 1       长春        1        901          1

* 2       吉林        2        902          1

* 3       四平        3        903          1

* 4       松原        4        904          1

* 5       通化        5        905          1

* 6       辽源        6        906          1

* 7       白城        7        907          1

* 8       白山        8        908          1

* 9       延吉        9        909          1

* -------------------------风骚的分割线-------------------------------

* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

* tb_user_profiles.dat文件内容,分隔符为"|"：

* userID   network     flow    cityID

* 1           2G       123      1

* 2           3G       333      2

* 3           3G       555      1

* 4           2G       777      3

* 5           3G       666      4

* -------------------------风骚的分割线-------------------------------

* joinKey.dat内容：

* city_code

* 1

* 2

* 3

* 4

* -------------------------风骚的分割线-------------------------------

*  结果：

*  1   长春  1   901 1   1   2G  123

*  1   长春  1   901 1   3   3G  555

*  2   吉林  2   902 1   2   3G  333

*  3   四平  3   903 1   4   2G  777

*  4   松原  4   904 1   5   3G  666

*/

public class SemiJoin extends Configured implements Tool{

    private static final Logger logger = LoggerFactory.getLogger(SemiJoin.class);

    public static class SemiJoinMapper extends Mapper {

        private CombineValues combineValues = new CombineValues();

        private HashSet joinKeySet = new HashSet();

        private Text flag = new Text();

        private Text joinKey = new Text();

        private Text secondPart = new Text();

        /**

         * 将参加join的key从DistributedCache取出放到内存中，以便在map端将要参加join的key过滤出来。b

         */

        @Override

        protected void setup(Context context)

                throws IOException, InterruptedException {

            BufferedReader br = null;

            //获得当前作业的DistributedCache相关文件

            Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

            String joinKeyStr = null;

            for(Path p : distributePaths){

                if(p.toString().endsWith("joinKey.dat")){

                    //读缓存文件，并放到mem中

                    br = new BufferedReader(new FileReader(p.toString()));

                    while(null!=(joinKeyStr=br.readLine())){

                        joinKeySet.add(joinKeyStr);

                    }

                }

            }

        }

        @Override

        protected void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            //获得文件输入路径

            String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

            //数据来自tb_dim_city.dat文件,标志即为"0"

            if(pathName.endsWith("tb_dim_city.dat")){

                String[] valueItems = value.toString().split("\\|");

                //过滤格式错误的记录

                if(valueItems.length != 5){

                    return;

                }

                //过滤掉不需要参加join的记录

                if(joinKeySet.contains(valueItems[0])){

                    flag.set("0");

                    joinKey.set(valueItems[0]);

                    secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);

                    combineValues.setFlag(flag);

                    combineValues.setJoinKey(joinKey);

                    combineValues.setSecondPart(secondPart);

                    context.write(combineValues.getJoinKey(), combineValues);

                }else{

                    return ;

                }

            }//数据来自于tb_user_profiles.dat，标志即为"1"

            else if(pathName.endsWith("tb_user_profiles.dat")){

                String[] valueItems = value.toString().split("\\|");

                //过滤格式错误的记录

                if(valueItems.length != 4){

                    return;

                }

                //过滤掉不需要参加join的记录

                if(joinKeySet.contains(valueItems[3])){

                    flag.set("1");

                    joinKey.set(valueItems[3]);

                    secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);

                    combineValues.setFlag(flag);

                    combineValues.setJoinKey(joinKey);

                    combineValues.setSecondPart(secondPart);

                    context.write(combineValues.getJoinKey(), combineValues);

                }else{

                    return ;

                }

            }

        }

    }

    public static class SemiJoinReducer extends Reducer {

        //存储一个分组中的左表信息

        private ArrayList leftTable = new ArrayList();

        //存储一个分组中的右表信息

        private ArrayList rightTable = new ArrayList();

        private Text secondPar = null;

        private Text output = new Text();

        /**

         * 一个分组调用一次reduce函数

         */

        @Override

        protected void reduce(Text key, Iterable value, Context context)

                throws IOException, InterruptedException {

            leftTable.clear();

            rightTable.clear();

            /**

             * 将分组中的元素按照文件分别进行存放

             * 这种方法要注意的问题：

             * 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，

             * 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最

             * 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。

             */

            for(CombineValues cv : value){

                secondPar = new Text(cv.getSecondPart().toString());

                //左表tb_dim_city

                if("0".equals(cv.getFlag().toString().trim())){

                    leftTable.add(secondPar);

                }

                //右表tb_user_profiles

                else if("1".equals(cv.getFlag().toString().trim())){

                    rightTable.add(secondPar);

                }

            }

            logger.info("tb_dim_city:"+leftTable.toString());

            logger.info("tb_user_profiles:"+rightTable.toString());

            for(Text leftPart : leftTable){

                for(Text rightPart : rightTable){

                    output.set(leftPart+ "\t" + rightPart);

                    context.write(key, output);

                }

            }

        }

    }

    @Override

    public int run(String[] args) throws Exception {

            Configuration conf=getConf(); //获得配置文件对象

            DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);

            Job job=new Job(conf,"LeftOutJoinMR");

            job.setJarByClass(SemiJoin.class);

            FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

            FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径

            job.setMapperClass(SemiJoinMapper.class);

            job.setReducerClass(SemiJoinReducer.class);

            job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

            job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式

            //设置map的输出key和value类型

            job.setMapOutputKeyClass(Text.class);

            job.setMapOutputValueClass(CombineValues.class);

            //设置reduce的输出key和value类型

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(Text.class);

            job.waitForCompletion(true);

            return job.isSuccessful()?0:1;

    }

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        try {

            int returnCode =  ToolRunner.run(new SemiJoin(),args);

            System.exit(returnCode);

        } catch (Exception e) {

            logger.error(e.getMessage());

        }

    }

}

这里还说说SemiJoin也是有一定的适用范围的，其抽取出来进行join的key是要放到内存中的，所以不能够太大，容易在Map端造成OOM。

总结

blog介绍了三种join方式。这三种join方式适用于不同的场景，其处理效率上的相差还是蛮大的，其中主要导致因素是网络传输。Map join效率最高，其次是SemiJoin，最低的是reduce join。另外，写分布式大数据处理程序的时最好要对整体要处理的数据分布情况作一个了解，这可以提高我们代码的效率，使数据的倾斜度降到最低，使我们的代码倾向性更好。

hadoop中MapReduce多种join实现实例分析的更多相关文章

MapReduce多种join实现实例分析（二）
上一篇<MapReduce多种join实现实例分析(一)>,大家可以点击回顾该篇文章.本文是MapReduce系列第二篇. 一.在Map端进行连接使用场景:一张表十分小.一张表很大.用法: ...
MapReduce多种join实现实例分析（一）
一.概述对于RDBMS中的join操作大伙一定非常熟悉,写sql的时候要十分注意细节,稍有差池就会耗时巨久造成很大的性能瓶颈,而在Hadoop中使用MapReduce框架进行join的操作时同 ...
Hadoop基础-MapReduce的Join操作
Hadoop基础-MapReduce的Join操作作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.连接操作Map端Join(适合处理小表+大表的情况) no001 no002 ...
Hadoop中两表JOIN的处理方法(转)
1. 概述在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而在HADOOP中进行JOIN操作,同样常见且耗时,由于Hadoop的独特设计思想,当进行JOIN操作时,有一些特殊的 ...
Hadoop中两表JOIN的处理方法
Dong的这篇博客我觉得把原理写的很详细,同时介绍了一些优化办法,利用二次排序或者布隆过滤器,但在之前实践中我并没有在join中用二者来优化,因为我不是作join优化的,而是做单纯的倾斜处理,做joi ...
浅谈hadoop中mapreduce的文件分发
近期在做数据分析的时候.须要在mapreduce中调用c语言写的接口.此时就须要把动态链接库so文件分发到hadoop的各个节点上,原来想自己来做这个分发,大概过程就是把so文件放在hdfs上面,然后 ...
hadoop中MapReduce中压缩的使用及4种压缩格式的特征的比较
在比较四中压缩方法之前,先来点干的,说一下在MapReduce的job中怎么使用压缩. MapReduce的压缩分为map端输出内容的压缩和reduce端输出的压缩,配置很简单,只要在作业的conf中 ...
用shell获得hadoop中mapreduce任务运行结果的状态
在近期的工作中,我需要用脚本来运行mapreduce,并且要判断运行的结果,根据结果来做下一步的动作. 开始我想到shell中获得上一条命令运行结果的方法,即判断"$?"的值 if ...
Hadoop中MapReduce作业流程图
MapReduce的流程分为11个步骤,4个实体 1.客户端:编写MapReduce的代码,配置作业,提交作业 2.JobTracker:初始化作业,分配作业,与TaskTracker通信,协调整个作 ...

随机推荐

SQL入门经典(六) 之视图
视图实际上就是一个存储查询,重点是可以混合和匹配来自基本表(或其他视图)的数据,从而创建在很多方面象另一个普通表那样的起的作用.可以创建一个简单的查询,仅仅从一个表(另一个视图)选择几列或几行,而忽略 ...
更高效地提高redis client多线程操作的并发吞吐设计
Redis是一个非常高效的基于内存的NOSQL数据库,它提供非常高效的数据读写效能.在实际应用中往往是带宽和CLIENT库读写损耗过高导致无法更好地发挥出Redis更出色的能力.下面结合一些redis ...
XHEditor（MVC4+DWZ）部分问题的解决
百度上下载了xheditor1.2.1 一.使用方法: 1.把解压的目录copy到VS中; 2.在需要用的View页面中引用js <script src="~/xheditor/xhe ...
jQuery+ASP.NET MVC基于CORS实现带cookie的跨域ajax请求
这是今天遇到的一个实际问题,在这篇随笔中记录一下解决方法. ASP.NET Web API提供了CORS支持,但ASP.NET MVC默认不支持,需要自己动手实现.可以写一个用于实现CORS的Acti ...
在Linux上用自己编译出来的coreclr与donet cli运行asp.net core程序
先在 github 上签出 coreclr 的源代码,运行 ./build.sh 命令进行编译,编译结果在 coreclr/bin/Product/Linux.x64.Debug/ 文件夹中. 接着签 ...
从3D Touch 看原生快速开发
全新的按压方式苹果继续为我们带来革命性的交互:Peek和Pop,Peek 和 Pop 让你能够预览所有类型的内容,甚至可对内容进行操作,却不必真的打开它们.例如,轻按屏幕,可用 Peek 预览收件箱中 ...
Wix 安装部署教程(十一) ---QuickWix
这次发布的是这两天做的一个WIX工具QuickWIX,主要解决两个问题点1.对大文件快速生成wix标签(files,Directories,ComponentRef):2.比较前后两次工程的差异.大的 ...
[.net 面向对象编程基础] (6) 基础中的基础——运算符和表达式
[.net 面向对象编程基础] (6) 基础中的基础——运算符和表达式说起C#运算符和表达式,小伙伴们肯定以为很简单,其实要用好表达式,不是一件容易的事.一个好的表达式可以让你做事半功倍的效果,比如 ...
Windows Azure Service Bus Topics实现系统松散耦合
前言 Windows Azure中的服务总线(Service Bus)提供了多种功能, 包括队列(Queue), 主题(Topic),中继(Relay),和通知中心(Notification Hub) ...
IOS 公共类-MyMBProgressUtil Progress显示
IOS 公共类-MyMBProgressUtil Progress显示此公共类用于显示提示框,对MBProgress的进一步封装.可以看下面的代码接口: @interface MyMBProgre ...

hadoop中MapReduce多种join实现实例分析

hadoop中MapReduce多种join实现实例分析的更多相关文章

随机推荐

热门专题